It All Depends on the Dependencies
In this article, we continue our series on writing parallel programs with a discussion on how to determine if a program has concurrent sections by looking at the code. The formal way to do this is to determine flow dependence. In fact, conditions for concurrency are basically the same as for reordering instructions. When instructions can be reordered, they can be executed concurrently. Here is simple example of two Fortran statements that are concurrent.
X = Y + 1 Z = Y + 2
There are three basic cases when statements like this are not current.
X = Y + 1 Z = X + 1
X = Y + 1 Z = 2 * X X = Y + 2
Z = X + 1 X = Y + 1
In truth, only the first case truly inhibits concurrency. Flow dependence is a property of the algorithm. Output and backward dependencies do not depend on the algorithm and are due to the implementation method. These types of dependencies can be avoided by renaming variables.
Sidebar One: Understanding Concurrent and Parallel 
As was mentioned in the first column, the terms concurrent and parallel are often used interchangeably. In the context of this column, they are not the same thing. As we state, concurrency is a property of a program or algorithm. If parts of the program can run independently, then these parts are concurrent. If the independent parts are run on separate processors then the program is often called parallel. 
For example, consider case two above. A concurrent version, where both the first and third lines are independent.
X1 = Y + 1 Z = 2 * X1 X = Y + 2
Similarly, example three can be written as follow.
Z = X1 + 1 X = Y + 1
In some cases, using clusters it is not even necessary to rename variables. You only need to make sure that the data produced later is used in the program. For example, with output dependence, case two can be performed as the following.
process 1 X = Y + 1 Z = 2 * X receive X from process 2 process 2 X = Y + 2 send X to process 1
Similarly, backward dependence could be handled as follows.
process 1 Z = X + 1 receive X from process 2 process 2 X = Y + 1 send X to process 1
In addition to data dependencies, there is control dependence. Branches in the program cause this type of dependence. As branches cannot be reordered with other instructions, they cannot be executed in parallel. For example, the following code should not be reordered.
IF(X. GT. 0) THEN Y = X Z = X ELSE Y = X Z= X ENDIF
Of course, reordering IFELSEENDIF instructions changes the algorithm of the program. One ways to make this execute in parallel is to replicate the statements. The decision point can be rewritten as follows. {mosgoogle right}
process 1 IF(X. GT. 0) THEN Y = X ELSE Y = X ENDIF send/receive process 2 IF(X. GT. 0) THEN Z = X ELSE Z= X ENDIF send/receive
Note that this particular code, if executed in parallel would probably be very inefficient because little data is computed and a send/receive would be needed.
File operations typically are not concurrent because they use modified file pointers. Parallel read/write operations require support from the operating system. Furthermore, GOTO's and STOP instructions cannot be executed concurrently.
Checking flow dependencies can be used for algorithms as well to make sure that a concurrent algorithm is really concurrent. Analyzing data dependencies can also aid in the decision of what parts of program are concurrent. To make code efficient it is reasonable to consider sets of instructions instead of single instructions. these sets can then be considered concurrent. For example these two IFTHENENDIF constructions are independent.
IF(T .GE. TS) THEN T = 0. ENDIF IF(X .GE. XS) THEN X = 0. ENDIF
This code sequence can be written as;
process 1 IF(T .GE. TS) THEN T = 0. ENDIF process 2 IF(X .GE. XS) THEN X = 0. ENDIF
DO I = 1, N TMP = X(I) * Y(I) Y(I) = TMP  1. ENDDO
As we have pointed out in previous columns, when concurrent segments are detected it does not mean that they can be efficiently executed in parallel. Overhead caused by data transfer may limit or exceed the gain from parallelization. Consider for example the following loop.
DO I = 1, 1000 X(I) = X(I) + 1. ENDDO
If this loop is executed in parallel, then data transfers may require more time than the loop needs to complete on a single processor.
Loops can have the same basic kinds of dependencies; forward (flow), backward, and output dependencies. For instance the following loop;
DO I = 1, N X(I) = X(I  1) ** 2 ENDDO
exhibits a forward dependency, where successive iterations depend on the previously completed computation. The following loop;
DO I = 1, N X(I) = X(I + 1) ** 2 ENDDO
exhibits backward dependency where successive iterations depend on values ahead of the computation. Unlike vector or shared memory machines, backward dependencies are not a problem for cluster concurrency. Only forward dependencies inhibit concurrency.
In general, the same can be said about output dependency, but life is often not that simple. Consider the following code:
DO I = 1, N Z = X(I)*X(I) Y(I) = Z ENDDO ... = Z
Here the variable Z is used after the DO loop, so it is not local to the loop. This case is easy to fix as the correct value of Z can be simply taken from the last iteration. Now consider the following code.
DO I = 1, N IF(X(I) .EQ. S) THEN Z = Y(I) ENDIF ENDDO
Here the output dependency is important. It is unknown what iteration produces the correct value of variable Z and correspondingly it is unknown what processor has the correct value of the variable when the parallel loop is over. This situation can be corrected when the number of the processor (or process)P containing correct value is saved in IPROC.
IPROC = 0 DO I = 1, N IF(X(I) .EQ. S) THEN Z = Y(I) IPROC = P ENDIF ENDDO
Let's consider more complicated example;
DO I =1, N IF(X(I) .LT. 0) THEN Y(IND(I)) = X(I) ENDIF ENDDO
In this case, it is required to keep an array indicating what values of array Y are changed. This requirement makes the parallel program much more complex. As output dependencies become more complex, the question of overhead becomes important. That is, will tracking these changes slow the code down so that the sequential code will run faster? Control dependencies for loops are statements that break execution of loop. Typically they are not avoidable. For example;
DO I = 1, N IF(X(I) .LE. 0) GOTO 100 ... ENDDO 100 CONTINUE
Concurrent execution of this loop assumes that some iterations after the break can be executed on other processors. But this may cause a fatal error because the program never expects the code to be executed. There are some ways to avoid these faults, but they are typically supported by the hardware.
An other type of loop, shown below, searches for the first element equal to a target value.
DO I = 1, N IF(L(I) .EQ. K) THEN K = I GOTO 100 ENDIF ENDDO
This loop is usually concurrent. The code must be modified to select the first value of K produced by the different nodes.
There are special kinds of loopcarried dependencies that can be made concurrent. When a loop contains an associative operation, it can be executed partially on cluster nodes and then a final operation is performed on the partial results. This task can be done in log N steps using an algorithm that could be called a reverse broadcast. A previous column we showed showed how this was done for the following loop.
S = 0. DO I =1, N S = S + X(I) ENDDO
Other frequently used associative operations are multiplication, minimum, maximum, Boolean or bitwise and, or and some others, and possibly some userdefined operations. Note that matrix operations can be associative as well and parallelized in a similar way. It is important to understand that the results of the summation are not used inside the loop, otherwise dependencies may prevent it from being evaluated concurrently.
A more complicated example is linear recursion. Sometimes data is changed in an iteration and then is used on subsequent iteration, but this data is changed so that it is possible to calculate its value for other iteration numbers. For example, the value K is increased by constant step in the following fragment.
K = N DO I = 1, N X(I) = Y(K) K = K  1 ENDDO
It can be seen that the variable K can be computed directly for each loop iteration (i.e. K= NI+1) and no longer inhibits parallelization. A variable which can be changed from recursive to computed using a loop parameter is called an induction. The above fragment can be made concurrent by rewriting it as follows (P is the current process number and NP is the number of processors.) {mosgoogle right}
N1 = N / NP * P N2 = N / NP * (P + 1) K = N  N1 + 1 DO I = N1, N2 X(I) = Y(K) K = K  1 ENDDO
Similarly, the following case is concurrent if the function F has no side effects (i.e. does not change any other variables).
P = 1. DO I = 1, N P = P * 2. X(I) = F(P) ENDDO
Sometimes a loop is written in such a way that two independent loops form one loop. In this case, splitting the loop may help with parallelization. In some cases, a nested loop may have a concurrent inner loop and outer loop that is not. It is clear that parallelization of outer loop makes parallel program more efficient. In this case, it is desirable to reorder loops so that outer loop becomes parallel.
This program can be executed in parallel the following way. First, one processor starts by running this subroutine. When SORTMERGE is called again the first "subsort" is placed on a separate processor. Now the first and second processor are working on this subroutine. When they call SORTMERGE again, they each use a second processor and the total number of processors in use now equals four. The process continues until the sorting work is small or you run out of processors. the sidebar show the concurrent version of SortMerge. Because recursion is harder to parallelize and load balance it is generally used less often than loop structures.
In the future columns, we will see how to organize the concurrency we found in these examples for parallel execution. For now, it may be interesting to look over your code and see if it can be made concurrent. Remember, however, Concurrent does not automatically imply parallel!
Sidebar Two: An Example of Recursive Concurrency 
The following subroutine is a sequential implementation of the SortMerge algorithm. (A is the array and N is the number of elements in the array.) SUBROUTINE SORTMERGE(A, N) IF(N .LE. 4) ! For small amount of data CALL BUBBLESORT(A, N) ! a simple sort algorithm is used RETURN ENDIF CALL SORTMERGE(A(1:N/2), N/2) ! Sorting first half of data CALL SORTMERGE(A(N/2+1:N), NN/2) ! Sorting second half of data CALL MERGE(A, A(1:N/2), A(N/2+1:N)) ! Merging data END The concurrent version of the subroutine can be written as follows. (NP is the number of processors.) SUBROUTINE SORTMERGE(A, N) COMMON /ORG/ STEP ! initially STEP =0 IF(P .GT. 0) THEN receive A(N/2+1:N), NN/2, STEP ENDIF IF(N .LE. 4) ! For small amount of data CALL BUBBLESORT(A, N) ! a simple sort algorithm is used RETURN ENDIF STEP = STEP + 1 CALL SORTMERGE(A(1:N/2), N/2) ! Sorting first half of data IF(2**STEP .LT. NP) THEN send A(N/2+1:N), NN/2, STEP to 2**(STEP1) + P receive A(N/2+1:N) ELSE CALL SORTMERGE(A(N/2+1:N), NN/2) !Sorting second half of data ENDIF CALL MERGE(A, A(1:N/2), A(N/2+1:N)) ! Merging data IF(STEP .GT. 1) THEN send A(N/2+1:N) to P2**(STEP2) ENDIF END

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.
Pavel Telegin is currently Department Head of Programming Technology and Methods in the Joint Supercomputer Center, Moscow, Russia.