Where the Cluster Meets the Code

Model. What Model?

In a cluster environment, there are two basic programming models that work well for parallel programs. The first is called Single Program Multiple Data (SPMD) where the program code is the same on all the computational nodes. Only the data may differ. The second scheme is more complicated and is called Multiple Program Multiple Data. Program code may differ on each computational node. Note that, in this case, MPI invocation of the program will require a special file with a description of the different parallel programs and their placement. In either case, the execution of the program on the cluster is similar. As it was noted in previous article, due to the nature of the cluster hardware, the main feature of most cluster programs is message passing.

In every case, the programs running on each node are really different programs because they have their own sets of data. For example, when we split or replicate a for loop, each program has its own loop parameter. And when we split the loop one can find that each copy for the program uses a unique loop parameter. This can be seen when running the program in sidebar 2 on a range of one to sixty four processors. For information on running MPI programs consult the MPI Mechanic column.

Sidebar Two: Slicing an Array

The following C example program will slice an example array index, n, by the number of MPI processes that are started by mpirun. In a real program each node would work on the array from index low to index high.

#include 
#include "mpi.h"
int main(int argc, char **argv){
  int P, N;
  int i, low, high, slice, n = 64;
  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &P);
  MPI_Comm_size(MPI_COMM_WORLD, &N);
  slice = n/N;
  low = P*slice;
  high = (P+1)*slice;
  if(P == N-1) high = n;
  for(i = low; i < high; ++i)
    printf("  Process %d says: my index is %d\n", P, i);
  printf("Process %d says: My final i = %d\n", P, i);
  MPI_Finalize();
}

Who Wants My Data?

It is important that data used by the program be placed correctly on each node. Initial data comes from static initializations and input statements. Static initializations are easily placed on all nodes that are used. As for input statements, it is more difficult. In many implementations only one node reads input data and then all or part of the data is sent to all or part of the other nodes. This is typically done by broadcast routines that minimize the amount of data needed to be sent. Implementation of a reliable broadcast depends on the computer architecture. Clusters typically consists of nodes connected by a switch or switches and provide complete node-to-node connections. In this environment, a reliable broadcast works the following way (See the Selim G. Akhl reference above):

  • Step 1.
    • Node 1 sends data to Node2
    • Data is now on nodes 1 to 2
  • Step 2.
    • Node 1 sends data to Node 3
    • Node 2 sends data to Node 4
    • Data is now on nodes 1 to 4
  • Step n
    • Data is on nodes 1 to 2n-1
    • Node k sends data to Node 2n-1+k
    • where 1<= k <= 2n-1
    • Data is on nodes 1 to 2n-1

It is easy to see that number of steps for broadcast for N nodes is log(N) (rounded up). This estimation is very important, as it is one of key issues in estimation of the time for parallel algorithms. Data, are computed on nodes and very often some or all of this data are needed on another node. It may be necessary to send or broadcast this data to the other nodes.

A way to reduce broadcasts it to replicate some of the computed data, if possible. If a processor is waiting for broadcast data, then it may be more efficient to calculate the same data and not do the broadcast at all. Let's recall the matrix multiplication program and consider some type of initialization of data. {mosgoogle right}

for(i=0; i < n; ++i){
  a[i][j] = (float)(i+j);
  b[i][j] = (float)(i-j);
}
matmul();

It is possible to do the same kind of parallelization we did for matmul for the initialization for data.

n1 = (n/N)*P;  // P is current process number
n2 = (n/N)*(P+1);
for(i=n1; i < n2; ++i){
  a[i][j] = (float)(i+j);
  b[i][j] = (float)(i-j);
}
matmul();

But after that we will need to provide the entire array b to all nodes. This will cause substantial overhead. Instead of this we can do the following.

n1 = (n/N)*P;  // P is current process number
n2 = (n/N)*(P+1);
for(i=n1; i < n2; ++i)
  a[i][j] = (float)(i+j);
for(i=0; i < n; ++i)
  b[i][j] = (float)(i-j);
matmul();

Here calculations of b are replicated. Also notice that write operations are usually made from only one node. This will require provisioning of data as well. When data are replicated on all or several computational nodes each has its own copy. When processing this algorithm we see that only parts of many arrays are used by the parallel programs. In particular, only one of the arrays, b, is used by all programs in the multiplication example. Other arrays, a, and c, are used only partially. We can consider the data of these arrays to be partitioned across the nodes. This method is often referred to as array distribution or array layout.

Perhaps one of the biggest benefits of this method is that it is not necessary to allocate all the memory for all the arrays. It is possible for a node to allocate only a small amount of memory used by certain arrays. The benefit of this method is that it allows for large programs that would never fit into the memory of a single computer to be executed. This result is often reflected in some cluster designs where a larger number of nodes reduce the amount of memory need per node.

Now that we have an understanding of the principles involved in running a program in parallel, we will examine some basic parallelization techniques in the next column.

This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.

Pavel Telegin is currently Department Head of Programming Technology and Methods in the Joint Supercomputer Center, Moscow, Russia.

    Search

    Feedburner

    Login Form

    Share The Bananas


    Creative Commons License
    ©2005-2012 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.