Article Index

Loops can have the same basic kinds of dependencies; forward (flow), backward, and output dependencies. For instance the following loop;

DO I = 1, N
  X(I) = X(I - 1) ** 2
ENDDO

exhibits a forward dependency, where successive iterations depend on the previously completed computation. The following loop;

DO I = 1, N
  X(I) = X(I + 1) ** 2
ENDDO

exhibits backward dependency where successive iterations depend on values ahead of the computation. Unlike vector or shared memory machines, backward dependencies are not a problem for cluster concurrency. Only forward dependencies inhibit concurrency.

In general, the same can be said about output dependency, but life is often not that simple. Consider the following code:

DO I = 1, N
  Z = X(I)*X(I)
  Y(I) = Z
ENDDO
... = Z

Here the variable Z is used after the DO loop, so it is not local to the loop. This case is easy to fix as the correct value of Z can be simply taken from the last iteration. Now consider the following code.

DO I = 1, N
  IF(X(I) .EQ. S) THEN
    Z = Y(I)
  ENDIF
ENDDO

Here the output dependency is important. It is unknown what iteration produces the correct value of variable Z and correspondingly it is unknown what processor has the correct value of the variable when the parallel loop is over. This situation can be corrected when the number of the processor (or process)P containing correct value is saved in IPROC.

IPROC = 0
DO I = 1, N
  IF(X(I) .EQ. S) THEN
    Z = Y(I)
    IPROC = P
  ENDIF
ENDDO

Let's consider more complicated example;

DO I =1, N
  IF(X(I) .LT. 0) THEN
    Y(IND(I)) = X(I)
  ENDIF
ENDDO

In this case, it is required to keep an array indicating what values of array Y are changed. This requirement makes the parallel program much more complex. As output dependencies become more complex, the question of overhead becomes important. That is, will tracking these changes slow the code down so that the sequential code will run faster? Control dependencies for loops are statements that break execution of loop. Typically they are not avoidable. For example;

DO I = 1, N
  IF(X(I) .LE. 0) GOTO 100
  ...
ENDDO
100 CONTINUE

Concurrent execution of this loop assumes that some iterations after the break can be executed on other processors. But this may cause a fatal error because the program never expects the code to be executed. There are some ways to avoid these faults, but they are typically supported by the hardware.

An other type of loop, shown below, searches for the first element equal to a target value.

DO I = 1, N
  IF(L(I) .EQ. K) THEN
    K = I
    GOTO 100
  ENDIF
ENDDO

This loop is usually concurrent. The code must be modified to select the first value of K produced by the different nodes.

There are special kinds of loop-carried dependencies that can be made concurrent. When a loop contains an associative operation, it can be executed partially on cluster nodes and then a final operation is performed on the partial results. This task can be done in log N steps using an algorithm that could be called a reverse broadcast. A previous column we showed showed how this was done for the following loop.

S = 0.
DO I =1, N
  S = S + X(I)
ENDDO

Other frequently used associative operations are multiplication, minimum, maximum, Boolean or bitwise and, or and some others, and possibly some user-defined operations. Note that matrix operations can be associative as well and parallelized in a similar way. It is important to understand that the results of the summation are not used inside the loop, otherwise dependencies may prevent it from being evaluated concurrently.

A more complicated example is linear recursion. Sometimes data is changed in an iteration and then is used on subsequent iteration, but this data is changed so that it is possible to calculate its value for other iteration numbers. For example, the value K is increased by constant step in the following fragment.

K = N
DO I = 1, N
  X(I) = Y(K)
  K = K - 1
ENDDO

It can be seen that the variable K can be computed directly for each loop iteration (i.e. K= N-I+1) and no longer inhibits parallelization. A variable which can be changed from recursive to computed using a loop parameter is called an induction. The above fragment can be made concurrent by rewriting it as follows (P is the current process number and NP is the number of processors.)

N1 = N / NP * P 
N2 = N / NP * (P + 1) 
K = N - N1 + 1
DO I = N1, N2
  X(I) = Y(K)
  K = K - 1
ENDDO

Similarly, the following case is concurrent if the function F has no side effects (i.e. does not change any other variables).

P = 1.
DO I = 1, N
    P = P * 2.
    X(I) = F(P)
ENDDO

Sometimes a loop is written in such a way that two independent loops form one loop. In this case, splitting the loop may help with parallelization. In some cases, a nested loop may have a concurrent inner loop and outer loop that is not. It is clear that parallelization of outer loop makes parallel program more efficient. In this case, it is desirable to reorder loops so that outer loop becomes parallel.

Recursion and Concurrency

Another commonly used program structure is recursion. Recursion is used less often than loops, but it is still used in some programs. For example, one common sorting algorithms is the Sort-Merge where the sorting array is split into two approximately equal parts, these parts are sorted by the same algorithm. A sequential Sort-Merge program shown in the sidebar.

This program can be executed in parallel the following way. First, one processor starts by running this subroutine. When SORTMERGE is called again the first "sub-sort" is placed on a separate processor. Now the first and second processor are working on this subroutine. When they call SORTMERGE again, they each use a second processor and the total number of processors in use now equals four. The process continues until the sorting work is small or you run out of processors. the sidebar show the concurrent version of Sort-Merge. Because recursion is harder to parallelize and load balance it is generally used less often than loop structures.

In the future columns, we will see how to organize the concurrency we found in these examples for parallel execution. For now, it may be interesting to look over your code and see if it can be made concurrent. Remember, however, Concurrent does not automatically imply parallel!

Sidebar Two: An Example of Recursive Concurrency

The following subroutine is a sequential implementation of the Sort-Merge algorithm. (A is the array and N is the number of elements in the array.)

SUBROUTINE SORTMERGE(A, N)
IF(N .LE. 4)             ! For small amount of data
  CALL BUBBLESORT(A, N)  ! a simple sort algorithm is used
  RETURN
ENDIF
CALL SORTMERGE(A(1:N/2), N/2)        ! Sorting first half of data
CALL SORTMERGE(A(N/2+1:N), N-N/2)    ! Sorting second half of data
CALL MERGE(A, A(1:N/2), A(N/2+1:N))  ! Merging data
END

The concurrent version of the subroutine can be written as follows. (NP is the number of processors.)

SUBROUTINE SORTMERGE(A, N)
COMMON /ORG/ STEP            ! initially STEP =0
IF(P .GT. 0) THEN
  receive A(N/2+1:N), N-N/2, STEP 
ENDIF
IF(N .LE. 4)                  ! For small amount of data
  CALL BUBBLESORT(A, N)       ! a simple sort algorithm is used
  RETURN
ENDIF
STEP = STEP + 1
CALL SORTMERGE(A(1:N/2), N/2)  ! Sorting first half of data
IF(2**STEP .LT. NP) THEN
  send A(N/2+1:N), N-N/2, STEP to 2**(STEP-1) + P
  receive A(N/2+1:N) 
ELSE
  CALL SORTMERGE(A(N/2+1:N), N-N/2)    !Sorting second half of data
ENDIF
CALL MERGE(A, A(1:N/2), A(N/2+1:N))      ! Merging data
IF(STEP .GT. 1) THEN
  send A(N/2+1:N) to P-2**(STEP-2)
ENDIF
END

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Pavel Telegin is currently Department Head of Programming Technology and Methods in the Joint Supercomputer Center, Moscow, Russia.

You have no rights to post comments

Search

Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.

Feedburner


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.