Is Your Code Concurrent? | Parallel Programming

It All Depends on the Dependencies

In this article, we continue our series on writing parallel programs with a discussion on how to determine if a program has concurrent sections by looking at the code. The formal way to do this is to determine flow dependence. In fact, conditions for concurrency are basically the same as for reordering instructions. When instructions can be reordered, they can be executed concurrently. Here is simple example of two Fortran statements that are concurrent.

  X = Y + 1
  Z = Y + 2

There are three basic cases when statements like this are not current.

The first is the case when data produced by one instruction is used by another subsequent instruction. This condition is called data flow dependence (or true or forward dependence). For example, in the case below the first assignment is not concurrent because the value produced by the first statement is used by the second statement.
```
  X = Y + 1 
  Z = X + 1 
```
The second case is the case when data, produced by one statement, is rewritten by another, so the wrong data may be used. This condition is called output dependence. For example, in the case below, these assignments are not concurrent because the X values could be computed in two separate locations.
```
  X = Y + 1
  Z = 2 * X
  X = Y + 2
```
The third case is the case when data produced by an instruction is used by a previous instruction. This condition is called backward dependence. For example, in the case below, the first line is not concurrent because the X value could be computed prior to its use in the first statement.
```
  Z = X + 1
  X = Y + 1
```

In truth, only the first case truly inhibits concurrency. Flow dependence is a property of the algorithm. Output and backward dependencies do not depend on the algorithm and are due to the implementation method. These types of dependencies can be avoided by renaming variables.

Sidebar One: Understanding Concurrent and Parallel

As was mentioned in the first column, the terms concurrent and parallel are often used interchangeably. In the context of this column, they are not the same thing. As we state, concurrency is a property of a program or algorithm. If parts of the program can run independently, then these parts are concurrent. If the independent parts are run on separate processors then the program is often called parallel.

For example, consider case two above. A concurrent version, where both the first and third lines are independent.

  X1 = Y + 1
  Z = 2 * X1
  X = Y + 2

Similarly, example three can be written as follow.

  Z = X1 + 1
  X = Y + 1

In some cases, using clusters it is not even necessary to rename variables. You only need to make sure that the data produced later is used in the program. For example, with output dependence, case two can be performed as the following.

process 1 
  X = Y + 1 	
  Z = 2 * X  
  receive X from process 2 

process 2
  X = Y + 2  
send X to process 1

Similarly, backward dependence could be handled as follows.

process 1
  Z = X + 1
  receive X from process 2

process 2
  X = Y + 1	
  send X to process 1

In addition to data dependencies, there is control dependence. Branches in the program cause this type of dependence. As branches cannot be reordered with other instructions, they cannot be executed in parallel. For example, the following code should not be reordered.

IF(X. GT. 0) THEN
  Y = -X
  Z = X
ELSE
  Y = X
  Z= -X
ENDIF

Of course, reordering IF-ELSE-ENDIF instructions changes the algorithm of the program. One ways to make this execute in parallel is to replicate the statements. The decision point can be rewritten as follows.

process 1
  IF(X. GT. 0) THEN
    Y = -X
  ELSE
    Y = X
  ENDIF
  send/receive

process 2
  IF(X. GT. 0) THEN
    Z = X
  ELSE
    Z= -X
  ENDIF
  send/receive

Note that this particular code, if executed in parallel would probably be very inefficient because little data is computed and a send/receive would be needed.

File operations typically are not concurrent because they use modified file pointers. Parallel read/write operations require support from the operating system. Furthermore, GOTO's and STOP instructions cannot be executed concurrently.

Checking flow dependencies can be used for algorithms as well to make sure that a concurrent algorithm is really concurrent. Analyzing data dependencies can also aid in the decision of what parts of program are concurrent. To make code efficient it is reasonable to consider sets of instructions instead of single instructions. these sets can then be considered concurrent. For example these two IF-THEN-ENDIF constructions are independent.

IF(T .GE. TS) THEN
  T = 0.
ENDIF
IF(X .GE. XS) THEN
  X = 0.
ENDIF

This code sequence can be written as;

process 1
  IF(T .GE. TS) THEN
    T = 0.
  ENDIF	

process 2
  IF(X .GE. XS) THEN
    X = 0.
  ENDIF

Concurrent Loops

The most important structure for parallelization is the DO loop (or for loop). Loops are widely used because they provide a large number of iterations and when iterations are concurrent this can produce a parallel section of code. In addition, it is easier to balance the load for parallel programs using regular objects like loops. The most important thing to check in a loop is if subsequent iterations depend on data from previous iterations. If this happens, then it is said that there are loop-carried dependencies. The distributed memory model used by clusters allows one to avoid conflicts with local data. For instance, in the example below, all iterations can be executed concurrently.

 DO I = 1, N
   TMP = X(I) * Y(I)
   Y(I) = TMP - 1.
 ENDDO

As we have pointed out in previous columns, when concurrent segments are detected it does not mean that they can be efficiently executed in parallel. Overhead caused by data transfer may limit or exceed the gain from parallelization. Consider for example the following loop.

 DO I = 1, 1000
   X(I) = X(I) + 1.
 ENDDO

If this loop is executed in parallel, then data transfers may require more time than the loop needs to complete on a single processor.

Loops can have the same basic kinds of dependencies; forward (flow), backward, and output dependencies. For instance the following loop;

DO I = 1, N
  X(I) = X(I - 1) ** 2
ENDDO

exhibits a forward dependency, where successive iterations depend on the previously completed computation. The following loop;

DO I = 1, N
  X(I) = X(I + 1) ** 2
ENDDO

exhibits backward dependency where successive iterations depend on values ahead of the computation. Unlike vector or shared memory machines, backward dependencies are not a problem for cluster concurrency. Only forward dependencies inhibit concurrency.

In general, the same can be said about output dependency, but life is often not that simple. Consider the following code:

DO I = 1, N
  Z = X(I)*X(I)
  Y(I) = Z
ENDDO
... = Z

Here the variable Z is used after the DO loop, so it is not local to the loop. This case is easy to fix as the correct value of Z can be simply taken from the last iteration. Now consider the following code.

DO I = 1, N
  IF(X(I) .EQ. S) THEN
    Z = Y(I)
  ENDIF
ENDDO

Here the output dependency is important. It is unknown what iteration produces the correct value of variable Z and correspondingly it is unknown what processor has the correct value of the variable when the parallel loop is over. This situation can be corrected when the number of the processor (or process)P containing correct value is saved in IPROC.

IPROC = 0
DO I = 1, N
  IF(X(I) .EQ. S) THEN
    Z = Y(I)
    IPROC = P
  ENDIF
ENDDO

Let's consider more complicated example;

DO I =1, N
  IF(X(I) .LT. 0) THEN
    Y(IND(I)) = X(I)
  ENDIF
ENDDO

In this case, it is required to keep an array indicating what values of array Y are changed. This requirement makes the parallel program much more complex. As output dependencies become more complex, the question of overhead becomes important. That is, will tracking these changes slow the code down so that the sequential code will run faster? Control dependencies for loops are statements that break execution of loop. Typically they are not avoidable. For example;

DO I = 1, N
  IF(X(I) .LE. 0) GOTO 100
  ...
ENDDO
100 CONTINUE

Concurrent execution of this loop assumes that some iterations after the break can be executed on other processors. But this may cause a fatal error because the program never expects the code to be executed. There are some ways to avoid these faults, but they are typically supported by the hardware.

An other type of loop, shown below, searches for the first element equal to a target value.

DO I = 1, N
  IF(L(I) .EQ. K) THEN
    K = I
    GOTO 100
  ENDIF
ENDDO

This loop is usually concurrent. The code must be modified to select the first value of K produced by the different nodes.

There are special kinds of loop-carried dependencies that can be made concurrent. When a loop contains an associative operation, it can be executed partially on cluster nodes and then a final operation is performed on the partial results. This task can be done in log N steps using an algorithm that could be called a reverse broadcast. A previous column we showed showed how this was done for the following loop.

S = 0.
DO I =1, N
  S = S + X(I)
ENDDO

Other frequently used associative operations are multiplication, minimum, maximum, Boolean or bitwise and, or and some others, and possibly some user-defined operations. Note that matrix operations can be associative as well and parallelized in a similar way. It is important to understand that the results of the summation are not used inside the loop, otherwise dependencies may prevent it from being evaluated concurrently.

A more complicated example is linear recursion. Sometimes data is changed in an iteration and then is used on subsequent iteration, but this data is changed so that it is possible to calculate its value for other iteration numbers. For example, the value K is increased by constant step in the following fragment.

K = N
DO I = 1, N
  X(I) = Y(K)
  K = K - 1
ENDDO

It can be seen that the variable K can be computed directly for each loop iteration (i.e. K= N-I+1) and no longer inhibits parallelization. A variable which can be changed from recursive to computed using a loop parameter is called an induction. The above fragment can be made concurrent by rewriting it as follows (P is the current process number and NP is the number of processors.)

N1 = N / NP * P 
N2 = N / NP * (P + 1) 
K = N - N1 + 1
DO I = N1, N2
  X(I) = Y(K)
  K = K - 1
ENDDO

Similarly, the following case is concurrent if the function F has no side effects (i.e. does not change any other variables).

P = 1.
DO I = 1, N
    P = P * 2.
    X(I) = F(P)
ENDDO

Sometimes a loop is written in such a way that two independent loops form one loop. In this case, splitting the loop may help with parallelization. In some cases, a nested loop may have a concurrent inner loop and outer loop that is not. It is clear that parallelization of outer loop makes parallel program more efficient. In this case, it is desirable to reorder loops so that outer loop becomes parallel.

Recursion and Concurrency

Another commonly used program structure is recursion. Recursion is used less often than loops, but it is still used in some programs. For example, one common sorting algorithms is the Sort-Merge where the sorting array is split into two approximately equal parts, these parts are sorted by the same algorithm. A sequential Sort-Merge program shown in the sidebar.

This program can be executed in parallel the following way. First, one processor starts by running this subroutine. When SORTMERGE is called again the first "sub-sort" is placed on a separate processor. Now the first and second processor are working on this subroutine. When they call SORTMERGE again, they each use a second processor and the total number of processors in use now equals four. The process continues until the sorting work is small or you run out of processors. the sidebar show the concurrent version of Sort-Merge. Because recursion is harder to parallelize and load balance it is generally used less often than loop structures.

In the future columns, we will see how to organize the concurrency we found in these examples for parallel execution. For now, it may be interesting to look over your code and see if it can be made concurrent. Remember, however, Concurrent does not automatically imply parallel!

Sidebar Two: An Example of Recursive Concurrency

The following subroutine is a sequential implementation of the Sort-Merge algorithm. (A is the array and N is the number of elements in the array.)

SUBROUTINE SORTMERGE(A, N)
IF(N .LE. 4)             ! For small amount of data
  CALL BUBBLESORT(A, N)  ! a simple sort algorithm is used
  RETURN
ENDIF
CALL SORTMERGE(A(1:N/2), N/2)        ! Sorting first half of data
CALL SORTMERGE(A(N/2+1:N), N-N/2)    ! Sorting second half of data
CALL MERGE(A, A(1:N/2), A(N/2+1:N))  ! Merging data
END

The concurrent version of the subroutine can be written as follows. (NP is the number of processors.)

SUBROUTINE SORTMERGE(A, N)
COMMON /ORG/ STEP            ! initially STEP =0
IF(P .GT. 0) THEN
  receive A(N/2+1:N), N-N/2, STEP 
ENDIF
IF(N .LE. 4)                  ! For small amount of data
  CALL BUBBLESORT(A, N)       ! a simple sort algorithm is used
  RETURN
ENDIF
STEP = STEP + 1
CALL SORTMERGE(A(1:N/2), N/2)  ! Sorting first half of data
IF(2**STEP .LT. NP) THEN
  send A(N/2+1:N), N-N/2, STEP to 2**(STEP-1) + P
  receive A(N/2+1:N) 
ELSE
  CALL SORTMERGE(A(N/2+1:N), N-N/2)    !Sorting second half of data
ENDIF
CALL MERGE(A, A(1:N/2), A(N/2+1:N))      ! Merging data
IF(STEP .GT. 1) THEN
  send A(N/2+1:N) to P-2**(STEP-2)
ENDIF
END

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Pavel Telegin is currently Department Head of Programming Technology and Methods in the Joint Supercomputer Center, Moscow, Russia.