Parallel Programming: A Balancing Act

Article Index

Static or Dynamic? It is a matter of balance

Now that we know how to identify parallel parts of our program, the question is now what to do with this knowledge. Or, how do you write a parallel program. To answer this question, we will discuss what the structure of a parallel program may look like. Programs can be organized in different ways. We already discussed SPMD (Single Program Multiple Data) and MPMD (Multiple Programs Multiple Data) models. SPMD and MPMD represents the way a program looks from the point of view of the cluster. Note, that using a MPMD model with MPI an "app" or "procgroup" file will be needed to start different programs on cluster nodes. Let's see what the programs look from the implementation standpoint.

There are a variety ways to structure parallel codes. In this article we explore three common models that have been successfully used in tools like BERT 77, a Fortran conversion tool. As we know,a parallel program may contain different concurrent parts. When making these parts parallel one may find that the processor load may vary quite a bit. We will ignore data transfers, and only look at the efficiency of the parallel code, i.e. load balance.

An Example

{mosgoogle right} Let's explore an example:

 CALL S1(X)
 DO I = 1, N
    CALL S2(Y(I))
 ENDDO
 CALL S3(Z)

These subroutines have no side effects, and our program has three independent parts: CALL S1, CALL S3 and the DO LOOP. We also assume the loop is concurrent. The execution times are as follows:

Of course, it seems easy to re-write this code and just place different parts of the program and the loop on different nodes. If this part of code was the whole program, and execution times of each part were consistent, this would be the best way. In practice, these are average times and rewriting the code this way would result in awkward program structures. Debugging would be difficult, even with four nodes.

If we add up the component times we see that the sequential execution time is 160 ms. The question is: How to use four nodes to get smaller execution time? When we make straight forward parallelizations, we get the workload represented in figure one.

Image
Figure One: Processor Load for Naive Parallelization

Now we get an unimpressive runtime of 100 ms instead of 160 ms. Let's see if we can do better.

Balancing with Data Flow

To find a better load balance we can split the DO loop into several parts and add part of the loop to each node. One of ways to do this is the Flow model. This model is a MPMD where the program is organized in a data flow master/worker style. Nodes are organized with one master node and several worker nodes. As we know, the program contains sequential and parallel parts. The master performs the sequential part of program and gives jobs to workers which work on parallel part of program.

The master dynamically splits parallel parts of the program based on the number of workers. When a worker starts, all data required for its execution are sent from the master to the worker. When the worker is finished, all results are sent back to the master. In our case, the program can be decomposed as shown in Figure Two.

Figure Two: Program Decomposition for a Flow model
Figure Two: Program Decomposition for a Flow model

In the figure, m is number of parts into which the DO loop is decomposed. An example of the time diagram with four nodes and loop split in four parts (m = 4) is given in Figure Three. In the figure, three nodes start to execute worker1, worker 2, and worker3.1. The first, Node 1 finishes worker1 and starts worker 3.2. Then Node 3 finishes worker 3.1 and starts worker 3.3. Finally, Node 2 finishes worker 2 and starts worker 3.4.

Figure Three: Balancing between nodes
Figure Three: Balancing between nodes

As you can see, the execution time is reduced to 60 ms. This is much better! The foloowing sidebar shows a template for a flow type programs.

PartTime
S125 ms
LOOP100 ms
S335 ms
Sidebar: Flow Algorithm Template

master

  n_workers = m+2
  balance workers to workload
  do k = 1, n_nodes
    transfer data to node k
  enddo
  do k = n_nodes+1, n_workers
    wait for a worker to finish
    receive data from worker w
    transfer data to worker w
  enddo
  do k = 1, n_nodes
    wait for finishing worker
    receive data from worker
  enddo

worker

  do while(not_finished)
  receive job from master
  if(job .eq. job1) then
     receive x
     call s1(x)
     send x to master
  elseif (job .eq. job2) then
   receive imin, imax, y(imin: imax)
   do i =imin, imax
   call s2(y(i))
   enddo
   send y(imin: imax)
  elseif (job .eq. job3) then
    receive z
    call s3(z)
   send z to master
  endif

As you can see in the program, master assigns jobs to first n_nodes workers. Then the master monitors the state of the workers. When a worker finishes, the master assigns another job to a worker. When all the jobs are assigned, the master simply waits for the remaining data.

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.