DO I = 1, N X(I) = X(I  1) ** 2 ENDDO
exhibits a forward dependency, where successive iterations depend on the previously completed computation. The following loop;
DO I = 1, N X(I) = X(I + 1) ** 2 ENDDO
exhibits backward dependency where successive iterations depend on values ahead of the computation. Unlike vector or shared memory machines, backward dependencies are not a problem for cluster concurrency. Only forward dependencies inhibit concurrency.
In general, the same can be said about output dependency, but life is often not that simple. Consider the following code:
DO I = 1, N Z = X(I)*X(I) Y(I) = Z ENDDO ... = Z
Here the variable Z is used after the DO loop, so it is not local to the loop. This case is easy to fix as the correct value of Z can be simply taken from the last iteration. Now consider the following code.
DO I = 1, N IF(X(I) .EQ. S) THEN Z = Y(I) ENDIF ENDDO
Here the output dependency is important. It is unknown what iteration produces the correct value of variable Z and correspondingly it is unknown what processor has the correct value of the variable when the parallel loop is over. This situation can be corrected when the number of the processor (or process)P containing correct value is saved in IPROC.
IPROC = 0 DO I = 1, N IF(X(I) .EQ. S) THEN Z = Y(I) IPROC = P ENDIF ENDDO
Let's consider more complicated example;
DO I =1, N IF(X(I) .LT. 0) THEN Y(IND(I)) = X(I) ENDIF ENDDO
In this case, it is required to keep an array indicating what values of array Y are changed. This requirement makes the parallel program much more complex. As output dependencies become more complex, the question of overhead becomes important. That is, will tracking these changes slow the code down so that the sequential code will run faster? Control dependencies for loops are statements that break execution of loop. Typically they are not avoidable. For example;
DO I = 1, N IF(X(I) .LE. 0) GOTO 100 ... ENDDO 100 CONTINUE
Concurrent execution of this loop assumes that some iterations after the break can be executed on other processors. But this may cause a fatal error because the program never expects the code to be executed. There are some ways to avoid these faults, but they are typically supported by the hardware.
An other type of loop, shown below, searches for the first element equal to a target value.
DO I = 1, N IF(L(I) .EQ. K) THEN K = I GOTO 100 ENDIF ENDDO
This loop is usually concurrent. The code must be modified to select the first value of K produced by the different nodes.
There are special kinds of loopcarried dependencies that can be made concurrent. When a loop contains an associative operation, it can be executed partially on cluster nodes and then a final operation is performed on the partial results. This task can be done in log N steps using an algorithm that could be called a reverse broadcast. A previous column we showed showed how this was done for the following loop.
S = 0. DO I =1, N S = S + X(I) ENDDO
Other frequently used associative operations are multiplication, minimum, maximum, Boolean or bitwise and, or and some others, and possibly some userdefined operations. Note that matrix operations can be associative as well and parallelized in a similar way. It is important to understand that the results of the summation are not used inside the loop, otherwise dependencies may prevent it from being evaluated concurrently.
A more complicated example is linear recursion. Sometimes data is changed in an iteration and then is used on subsequent iteration, but this data is changed so that it is possible to calculate its value for other iteration numbers. For example, the value K is increased by constant step in the following fragment.
K = N DO I = 1, N X(I) = Y(K) K = K  1 ENDDO
It can be seen that the variable K can be computed directly for each loop iteration (i.e. K= NI+1) and no longer inhibits parallelization. A variable which can be changed from recursive to computed using a loop parameter is called an induction. The above fragment can be made concurrent by rewriting it as follows (P is the current process number and NP is the number of processors.)
N1 = N / NP * P N2 = N / NP * (P + 1) K = N  N1 + 1 DO I = N1, N2 X(I) = Y(K) K = K  1 ENDDO
Similarly, the following case is concurrent if the function F has no side effects (i.e. does not change any other variables).
P = 1. DO I = 1, N P = P * 2. X(I) = F(P) ENDDO
Sometimes a loop is written in such a way that two independent loops form one loop. In this case, splitting the loop may help with parallelization. In some cases, a nested loop may have a concurrent inner loop and outer loop that is not. It is clear that parallelization of outer loop makes parallel program more efficient. In this case, it is desirable to reorder loops so that outer loop becomes parallel.
Recursion and Concurrency
Another commonly used program structure is recursion. Recursion is used less often than loops, but it is still used in some programs. For example, one common sorting algorithms is the SortMerge where the sorting array is split into two approximately equal parts, these parts are sorted by the same algorithm. A sequential SortMerge program shown in the sidebar.This program can be executed in parallel the following way. First, one processor starts by running this subroutine. When SORTMERGE is called again the first "subsort" is placed on a separate processor. Now the first and second processor are working on this subroutine. When they call SORTMERGE again, they each use a second processor and the total number of processors in use now equals four. The process continues until the sorting work is small or you run out of processors. the sidebar show the concurrent version of SortMerge. Because recursion is harder to parallelize and load balance it is generally used less often than loop structures.
In the future columns, we will see how to organize the concurrency we found in these examples for parallel execution. For now, it may be interesting to look over your code and see if it can be made concurrent. Remember, however, Concurrent does not automatically imply parallel!
Sidebar Two: An Example of Recursive Concurrency 
The following subroutine is a sequential implementation of the SortMerge algorithm. (A is the array and N is the number of elements in the array.) SUBROUTINE SORTMERGE(A, N) IF(N .LE. 4) ! For small amount of data CALL BUBBLESORT(A, N) ! a simple sort algorithm is used RETURN ENDIF CALL SORTMERGE(A(1:N/2), N/2) ! Sorting first half of data CALL SORTMERGE(A(N/2+1:N), NN/2) ! Sorting second half of data CALL MERGE(A, A(1:N/2), A(N/2+1:N)) ! Merging data END The concurrent version of the subroutine can be written as follows. (NP is the number of processors.) SUBROUTINE SORTMERGE(A, N) COMMON /ORG/ STEP ! initially STEP =0 IF(P .GT. 0) THEN receive A(N/2+1:N), NN/2, STEP ENDIF IF(N .LE. 4) ! For small amount of data CALL BUBBLESORT(A, N) ! a simple sort algorithm is used RETURN ENDIF STEP = STEP + 1 CALL SORTMERGE(A(1:N/2), N/2) ! Sorting first half of data IF(2**STEP .LT. NP) THEN send A(N/2+1:N), NN/2, STEP to 2**(STEP1) + P receive A(N/2+1:N) ELSE CALL SORTMERGE(A(N/2+1:N), NN/2) !Sorting second half of data ENDIF CALL MERGE(A, A(1:N/2), A(N/2+1:N)) ! Merging data IF(STEP .GT. 1) THEN send A(N/2+1:N) to P2**(STEP2) ENDIF END

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.
Pavel Telegin is currently Department Head of Programming Technology and Methods in the Joint Supercomputer Center, Moscow, Russia.