Static or Dynamic? It is a matter of balance
Now that we know how to identify parallel parts of our program, the question is now what to do with this knowledge. Or, how do you write a parallel program. To answer this question, we will discuss what the structure of a parallel program may look like. Programs can be organized in different ways. We already discussed SPMD (Single Program Multiple Data) and MPMD (Multiple Programs Multiple Data) models. SPMD and MPMD represents the way a program looks from the point of view of the cluster. Note, that using a MPMD model with MPI an "app" or "procgroup" file will be needed to start different programs on cluster nodes. Let's see what the programs look from the implementation standpoint.
There are a variety ways to structure parallel codes. In this article we explore three common models that have been successfully used in tools like BERT 77, a Fortran conversion tool. As we know,a parallel program may contain different concurrent parts. When making these parts parallel one may find that the processor load may vary quite a bit. We will ignore data transfers, and only look at the efficiency of the parallel code, i.e. load balance.
An Example
Let's explore an example:CALL S1(X) DO I = 1, N CALL S2(Y(I)) ENDDO CALL S3(Z)
These subroutines have no side effects, and our program has three independent parts: CALL S1, CALL S3 and the DO LOOP. We also assume the loop is concurrent. The execution times are as follows:
Part | Time |
---|---|
S1 | 25 ms|
LOOP | 100 ms|
S3 | 35 ms
If we add up the component times we see that the sequential execution time is 160 ms. The question is: How to use four nodes to get smaller execution time? When we make straight forward parallelizations, we get the workload represented in figure one.
Now we get an unimpressive runtime of 100 ms instead of 160 ms. Let's see if we can do better.
Balancing with Data Flow
To find a better load balance we can split the DO loop into several parts and add part of the loop to each node. One of ways to do this is the Flow model. This model is a MPMD where the program is organized in a data flow master/worker style. Nodes are organized with one master node and several worker nodes. As we know, the program contains sequential and parallel parts. The master performs the sequential part of program and gives jobs to workers which work on parallel part of program.The master dynamically splits parallel parts of the program based on the number of workers. When a worker starts, all data required for its execution are sent from the master to the worker. When the worker is finished, all results are sent back to the master. In our case, the program can be decomposed as shown in Figure Two.
In the figure, m is number of parts into which the DO loop is decomposed. An example of the time diagram with four nodes and loop split in four parts (m = 4) is given in Figure Three. In the figure, three nodes start to execute worker1, worker 2, and worker3.1. The first, Node 1 finishes worker1 and starts worker 3.2. Then Node 3 finishes worker 3.1 and starts worker 3.3. Finally, Node 2 finishes worker 2 and starts worker 3.4.
As you can see, the execution time is reduced to 60 ms. This is much better! The foloowing sidebar shows a template for a flow type programs.
Sidebar: Flow Algorithm Template |
master n_workers = m+2 balance workers to workload do k = 1, n_nodes transfer data to node k enddo do k = n_nodes+1, n_workers wait for a worker to finish receive data from worker w transfer data to worker w enddo do k = 1, n_nodes wait for finishing worker receive data from worker enddo worker do while(not_finished) receive job from master if(job .eq. job1) then receive x call s1(x) send x to master elseif (job .eq. job2) then receive imin, imax, y(imin: imax) do i =imin, imax call s2(y(i)) enddo send y(imin: imax) elseif (job .eq. job3) then receive z call s3(z) send z to master endif
|
As you can see in the program, master assigns jobs to first n_nodes workers. Then the master monitors the state of the workers. When a worker finishes, the master assigns another job to a worker. When all the jobs are assigned, the master simply waits for the remaining data.