Static or Dynamic? It is a matter of balance
Now that we know how to identify parallel parts of our program, the question is now what to do with this knowledge. Or, how do you write a parallel program. To answer this question, we will discuss what the structure of a parallel program may look like. Programs can be organized in different ways. We already discussed SPMD (Single Program Multiple Data) and MPMD (Multiple Programs Multiple Data) models. SPMD and MPMD represents the way a program looks from the point of view of the cluster. Note, that using a MPMD model with MPI an "app" or "procgroup" file will be needed to start different programs on cluster nodes. Let's see what the programs look from the implementation standpoint.
There are a variety ways to structure parallel codes. In this article we explore three common models that have been successfully used in tools like BERT 77, a Fortran conversion tool. As we know,a parallel program may contain different concurrent parts. When making these parts parallel one may find that the processor load may vary quite a bit. We will ignore data transfers, and only look at the efficiency of the parallel code, i.e. load balance.
An Example
Let's explore an example:CALL S1(X) DO I = 1, N CALL S2(Y(I)) ENDDO CALL S3(Z)
These subroutines have no side effects, and our program has three independent parts: CALL S1, CALL S3 and the DO LOOP. We also assume the loop is concurrent. The execution times are as follows:
Part | Time |
---|---|
S1 | 25 ms|
LOOP | 100 ms|
S3 | 35 ms
If we add up the component times we see that the sequential execution time is 160 ms. The question is: How to use four nodes to get smaller execution time? When we make straight forward parallelizations, we get the workload represented in figure one.

Now we get an unimpressive runtime of 100 ms instead of 160 ms. Let's see if we can do better.
Balancing with Data Flow
To find a better load balance we can split the DO loop into several parts and add part of the loop to each node. One of ways to do this is the Flow model. This model is a MPMD where the program is organized in a data flow master/worker style. Nodes are organized with one master node and several worker nodes. As we know, the program contains sequential and parallel parts. The master performs the sequential part of program and gives jobs to workers which work on parallel part of program.The master dynamically splits parallel parts of the program based on the number of workers. When a worker starts, all data required for its execution are sent from the master to the worker. When the worker is finished, all results are sent back to the master. In our case, the program can be decomposed as shown in Figure Two.

In the figure, m is number of parts into which the DO loop is decomposed. An example of the time diagram with four nodes and loop split in four parts (m = 4) is given in Figure Three. In the figure, three nodes start to execute worker1, worker 2, and worker3.1. The first, Node 1 finishes worker1 and starts worker 3.2. Then Node 3 finishes worker 3.1 and starts worker 3.3. Finally, Node 2 finishes worker 2 and starts worker 3.4.

As you can see, the execution time is reduced to 60 ms. This is much better! The foloowing sidebar shows a template for a flow type programs.
Sidebar: Flow Algorithm Template |
master n_workers = m+2 balance workers to workload do k = 1, n_nodes transfer data to node k enddo do k = n_nodes+1, n_workers wait for a worker to finish receive data from worker w transfer data to worker w enddo do k = 1, n_nodes wait for finishing worker receive data from worker enddo worker do while(not_finished) receive job from master if(job .eq. job1) then receive x call s1(x) send x to master elseif (job .eq. job2) then receive imin, imax, y(imin: imax) do i =imin, imax call s2(y(i)) enddo send y(imin: imax) elseif (job .eq. job3) then receive z call s3(z) send z to master endif
|
As you can see in the program, master assigns jobs to first n_nodes workers. Then the master monitors the state of the workers. When a worker finishes, the master assigns another job to a worker. When all the jobs are assigned, the master simply waits for the remaining data.
Be Dynamic
As we have seen, the Flow model keeps data on the master and each time it is needed it is sent to all workers regardless of what data was there before. When a lot of data is sent to workers, there is an increase in overhead and a decrease in performance. In this case, it can be reasonable to use the dynamic model. This model is very similar to the flow model, but it is a SPMD program. The difference is that data is kept on all processors so it is ready for use. This condition requires that after a parallel part of code is executed each node sends new data to all the other nodes. The other main feature of this model is that all processors execute the same code until they need I/O or need to execute a parallel section. I/O is executed only on first (master) node. Parallel sections are executed similar to the flow model. The sidebar shows an example of the dynamic algorithm.
Sidebar: A Dynamic Algorithm |
IF(master) THEN n_parts = n_workers = m+2 ! compute number of ! parallel parts DO i_part=1, n_workets receive data from worker w load_worker(w) ! send to worker next ! portion or end signal ENDDO ELSE work = .true. DOWHILE(work) compute (first time) or receive next part goto (1, 2, 3, 4), part 1 receive x from master call s1(x) ! first part (block) send x to master goto 5 2 receive imin, imax, y(imin:imax) from master DO i = imin, imax call s2(y(i)) ENDDO send y(imin:imax) to master goto 5 3 receive z from master call s3(z) ! first part (block) send z to master goto 5 4 work = .false. ! end of work 5 CONTINUE ENDDO ENDIF broadcast(x, y, z) ! broadcast from master to all
|
The load balance of the dynamic algorithm is practically the same as with the flow model. Which model is best to use? The answer depends on how much data is used and how much data is produced by the parallel segments.
The Static Model
One may notice that the local memory of a node is used not very efficiently in the previous models. When large amount of data are used it can be desirable to keep only part of data on each node to minimize data movement and to support large problems.The static model is based on array distribution and is probably the most common model used for parallelization. This strategy means that arrays are split between computational nodes, and operations are performed on these elements on specific nodes. For example the following code fragment can be executed with a static model.
REAL Y(N) ... DO I = 1, N CALL S2(Y(I)) ENDDO
A static parallelization is shown in the sidebar.
Sidebar: Example static parallelization |
! suppose that n is divisible by n_nodes real y(n/n_nodes) ... low = P*(n/N_nodes)+1 if(current_node .eq. node1) then s1(x) else if(current_node .eq. node2) then s3(z) endif high = (P+1)*(n/N_nodes) do I = low, high call s2(y(I- P*(n/N_nodes)) enddo
|
In general, as the number of nodes increases, the amount of memory needed per node decreases because smaller portions of the array are placed on each node. Let's look again at the program. With a static parallelization we can assign S1 to Node 1 and S3 to Node 3 as before. We will then distribute array Y on all four nodes. Figure Four shows this distribution.

We get 60 ms again. Can we do better? Another way is to distribute the array is on a subset of nodes, in this case two nodes. Then for the same program we can get the following diagram.

Now we got 50 ms for the program. This result is a good, but the price one has paid for this model is that the balancing of parallel loops is more "rigid" than you may like.
A Recap
At this point, a summary may help to understand the differences between the models. In the Flow model, programs are executed as MPMD and there is no data partitioning. The master node basically acts as a distributor and collector of data. The worker nodes are sent data which is processed and sent back to the master node. The scheduling is dynamic which means that it can accommodate variations in run-times and still maintain a good efficiency. (i.e. Run-time decisions can be made as to whether a certain loop should be executed in parallel) The size of the program and data is limited by the size of the master node.The Dynamic model programs are executed as SPMD and like the Flow model there is no partitioning. Each node, including the master, has the same program. All nodes are kept up to date at the completion of each parallel part of the program. The scheduling is dynamic and the size of the problem is limited by the size of a node because the entire program and data must fit on each node.
The Static model is SPMD as well, however, the data is partitioned across nodes. The scheduling is static and thus could have lower efficiencies if loop boundaries change from run to run. (i.e. The program is committed to parallel execution at certain points in the program) The size of the program can be very large as the data are sliced and placed on different nodes.
This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.
Pavel Telegin is currently Department Head of Programming Technology and Methods in the Joint Supercomputer Center, Moscow, Russia.