Slicing and Dicing Your Code | Parallel Programming

Want to parallelize your code? Before you dive in, you might want to test the waters.

It can be said be said that writing parallel code is easy. It can also be said that getting your code to run fast and produce correct answers is a bit more difficult. With parallel computing, we don't have the luxury of a compiler optimizing for us, so we have to do the work. In this column we are going to look at some of the fundamentals and hopefully get you thinking about some of the issues that are critical for your success.

Clusters are organized in such a way that there is a set of independent computers connected by a communication network. The difference between clusters and multiprocessor computers (SMP systems) is the memory structure. Multiprocessors contain several CPUs which are connected in someway to memory. They can be connected by a bus or crossbar, but the most important thing is that all processors are connected to all the memory. This configuration is called shared memory. With a cluster, each node has its own memory that is local to cluster node. Other nodes can only access this memory through a communication network. This configuration is called distributed memory and is what makes a programming a cluster different from shared memory computer. Accessing the memory of other nodes results in substantial time delay. And because of this delay, a program that runs well on a multiprocessor will not necessarily run well on a cluster. Of course, each node can be and often is a multiprocessor, but the key to cluster programming is programming distributed memory systems because the limiting factor is the communication speed.

Regardless of processor or interconnect, all clusters have the same common programming model based on distributed memory. The goal of the column is to discuss how one writes efficient programs for clusters. We will not cover issues concerning shared memory systems. This information is available from other sources.

After a cluster is assembled and software installed, it is time to run application programs on it. Of course, the cluster was created to make application programs run fast. When you run your existing sequential programs on your cluster you will find that it usually runs as fast as it will run on one cluster node. You can load all cluster nodes with a copy of the program, so you can exploit many nodes solving many similar problems at one time. In some cases your program may take a long time to run, and you would like to use the full power of your cluster to work on this program. In this case, you need your program to be parallel.

Looking at the Code

In an ordinary sequential program, all statements are executed one after another. In a parallel program, there are many concurrent statements executed at the same time. The programmer should be sure of two things with such a program. First, concurrent statements must be independent each of each other. And second, they also should be have the correct data to process. If you already have a parallel version of your program, and it runs much faster when you run it on cluster, you are lucky. Often programs with concurrency do not always run faster on a cluster (see sidebar). Fortunately, this is not the End of the World (or career), but this means that you have to take a closer look at your program.

Sidebar: Concurrent and Parallel

The terms concurrent and parallel are often used interchangeably. In the context of this column, they are not the same thing. Concurrency is a property of a program or algorithm. If parts of the program can run independently, then these parts are concurrent. If the independent parts are run on separate processors then the program is often called parallel. The distinction is subtle, but very important when real hardware is used. Since the goal is to make your program run faster, we need to ask the question, Does making all the concurrent parts of the program parallel increase execution speed? Unfortunately the answer is maybe, because in some cases running concurrent parts of your program in parallel may actually slow down your program!. From the above discussion we can make the following general statements about cluster programming:

Concurrency does not necessarily imply parallel execution.
Concurrency is a property of the program.
Efficient parallel execution of concurrent programs is a property of the hardware.

If you remember the above three rules, you will have an easier time navigating in the cluster world.

There are therefore two situations where programmers often find themselves.

Situation 1:You have an existing sequential program.
Situation 2: You have a parallel program, but it does not run fast enough.

In the first case, you need to make your program run on several nodes, i.e. parallelize your program. After you parallelize your program, you may move to the second case because you expectation of "parallel means faster" may not be true. Or, perhaps before you parallelize your program you would like to investigate the efficiency of running in parallel.

Before You Start Slicing

The first step to understanding efficiency is determining which operations can be done concurrently. For example, a program may have several different independent actions, so they can be executed in parallel. This is called block parallelism. Block parallelism appears when there are sections (blocks) of independent code. For clusters, such parallelism requires large blocks as small blocks or single operations will not be efficient. For example, there can be subroutines, statements containing function calls, or loops. Consider the following C code snippet.

s1(x); 
for(i=0, i<n, ++i)
  y[i] = y[i+1];
z = s3(z);

In the case where all blocks are independent (concurrent) this can be executed in parallel as follows. Node 1

s1(x);

Node 2

for(i=0, i<n, ++i)
  y[i] = y[i+1];

Node 3

z = s3(z);

Block parallelism usually arises due to the nature of the problem and can be the most evident type of parallelism. However, unless it is recursive, block parallelism typically is rarely used in parallel programs because of the reasons we will discuss later. The most common kind of parallelism is loop parallelism. Consider the following C program loop.

for(i=0, i<1000, ++i) 
  x[i] = x[i] + 1.;

The loop can be split and executed concurrently as follows.

Processor1

  for(i=0, i< 250,++i)
    x[i] = x[i] + 1.;

Processor 2

  for(i=251, i<500, ++i)
    x[i] = x[i] + 1.;

Processor 3

  for(i=501, i<751, ++i)
    x[i] = x[i] + 1.;

Processor4

  for(i=751, i<1000, ++i)
    x[i] = x[i] + 1.;

Loop parallelism is very popular in parallel programming because it distributes the workload across cluster nodes evenly. In the above example, four processors have the same amount of work. Of course, you cannot just split your program loops on different nodes and run your code. First, nodes will probably require data they do not have. Data produced on a node will probably be used by the program to produce results or data produced on some node will be needed on another nodes. Were this not the case, we would not have a single program, but a collection of independent programs! When data is needed it is necessary to send data from one node to another or from one node to many nodes. In the program, this communication can be implicitly or explicitly described and depends on what tools you use to make your program parallel. Very often data transfer is done as explicit message passing, which is performed by special libraries like PVM (Parallel Virtual Machine) or MPI (Message Passing Interface). You can learn more about MPI in the MPI Monkey column and more about PVM from the Getting Started With Clusters column. Implicit data transfer can be done by using automatic tools which we will discuss at a later time.

Facing Reality

Message passing whether implicit or explicit always consumes time. When messages are frequent or huge they can easily consume more time than can be saved by parallel execution. For this reason, it is important to know how much time is saved by parallel execution and how much time is lost to perform message passing. In the following FORTRAN program

      READ *, X;
      S = 0. 
      DO I = 1, N
        S=S + X(I)
      ENDDO
      PRINT *, S

it is possible to make the concurrent summation execute in parallel for this program. Let's consider the efficiency when executed on a 2-node cluster.

Node 1

      READ *, X
      send X(N/2 + 1: N) to Node 2 
      S1 = 0. 
      DO I = 1, N/2
        S1=S1 + X(I)
      ENDDO	
      receive S2 from Node 2
      S = S1 + S2
      PRINT *, S

Node 2

      receive X(N/2 + 1: N) from Node 1
      S2 = 0.
      DO I = N/2+1, N
      S2=S2 + X(I)
      ENDDO 
      send S2 to Node 1

Note that send and receive are not elements of the language. They just mean that data is sent to (received from) another node. In practice, for most systems the transfer time of one word through the network is larger than the time of an ordinary operation and the transfer of first word will be much longer! With a 1 GHz processor and Gigabit Ethernet (Gigabit Ethernet) connection we can have approximately the following times.

S1=S1 + X(I)                       -   1 ns (nanosecond)
transfer of one element of X(I)  -   5 ns (nanoseconds)

When N = 1,000,000 the time of sequential execution of the DO loop is about 1 ms (millisecond). The time of parallel execution is half of the execution time of the sequential loop or .5 ms. The time to transfer 500 words will be no less than 2.5 ms creating a total execution time of about 3 ms. So, the execution time of the parallel loop is 3 times longer than the sequential one! If we have smaller DO loop, like N=1000 the difference would be even much more dramatic.

All is not lost however, let's consider the following C program example:

s = homer(a) + homer(b);
mp = 4;
for(int m=1; m<=n-1; m++)
{
  s = s + mp*homer(a + h*m);
  if(mp == 4)
    mp = 2;
  else
    mp = 4;
}
simpson = s*h/3;

In this case a 2-node parallel program looks like the following (assume an even number of half iterations).

Node 1

s = homer(a) + homer(b);
mp = 4;
for(int m=1; m<=(n-2)/2; m++)
{
  s1 = s1 + mp*homer(a + h*m);
  if(mp == 4) 
    mp = 2;
  else
    mp = 4;
}
receive s2 from Node 2
s = s1 + s2;
simpson = s*h/3;

Node 2

s2 = 0;
mp = 4;   
for(int m=(n-2)/2+1; m<=n-2; m++)
{
  s2 = s2 + mp*homer(a+h*m);
  if(mp == 4) 
    mp = 2;
   else
    mp = 4;
}
send s2 to Node 1

Here we have only one transfer of one word. Using Gigabit Ethernet this will require 20 to 50 μs (microseconds) depending on the implementation. Let's assume it is 30 us. Furthermore, we will assume the function call homer can be executed in 100 ns, so we can neglect the time for the rest of the loop. In this case, with only 1000 iterations we have a sequential time for the loop equal to 100 us. Using the same process as be did above, the parallel time will be 80 us. This difference is a small improvement, but an improvement none the less. With 1,000,000 iterations, performance of the parallel program is practically doubled, while as we saw in the previous example it will be inefficient with any number of iterations.

These examples show that it is not just sufficient to blindly parallelize programs. It is important to place code on nodes only if it increases performance i.e. concurrent portions of the program need to be scheduled based on the underlying hardware. Consider the above examples if the communication layer were faster. Therefore, to understand how to improve efficiency, it is desirable to know execution times of important program parts, i.e. profiling. In future columns we will discuss ways to determine these numbers, use execution models, develop basic algorithms, learn ways to express parallelism in different languages, examine tools for parallel conversion, and much more.

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Pavel Telegin is currently Department Head of Programming Technology and Methods in the Joint Supercomputer Center, Moscow, Russia.