More parallel secrets; speedup, efficiency and your code
In the previous column, we found that parallel programs for clusters have very subtle differences and their efficiency requires careful examination of the code. In this article, we will see what a typical parallel program looks like and how it is executed on a cluster. Be warned, however, there is a bit of gentle mathematics in this column. It will not hurt, we promise.
For simplicity, let us consider that each cluster node contains one processor and each node runs one program. Parallel computation occurs because the programs are communicating with other nodes. Using nodes with two and more processors on each node will work as well as long as the number of programs matches the number of processors. If we ignore memory contention and possible local communication optimizations, then we assume each programs occupies its own memory and we can think about it as running on a separate node with local memory. Therefore, a node with several processors can be considered to contain several virtual uniprocessor nodes. The possibility of programming multiprocessor nodes using threads will be discussed in future columns.
To estimate how much faster parallel program is than sequential program we typically use speedup and efficiency. See Sidebar One for the definition of these terms as we will be using them quite often in our discussions about parallel programming.
Choose Wisely
There are two basic ways to write a new program for clusters. The first way is when you have an existing sequential program and want to make it parallel. The second way is to write a new parallel program. Both ways are widely used when programming clusters. We will not discuss which way is better as it all depends on how much work is required. Generally speaking, if you write a new parallel program, you can make it more efficient. Of course, a parallel algorithm should be used. To understand the algorithm it is necessary to look at the number of operations as function of problem size, i.e. algorithmic complexity. For example, let us look a matrix multiplication.
Sidebar One: Speedup and Efficiency 
The terms speedup and efficiency are often used when talking about clusters. The following definitions will be used in this column. speedup = sequential runtime/parallel runtime program efficiency = speedup/number of processors Speedup shows how much faster a parallel version of the program is compared to a sequential version. Efficiency shows how well the program scales on a cluster. 
void matmul(){ ... for(i=0; i < n; ++i) for(j=0; j < n; ++j){ c[i][j] = 0; for(k=0; k < n; ++k) c[i][j] += a[i][k] * b[k][j]; } }
It is quite clear that the number of operations depends on the number of iterations in all three loops. As loops are nested, the number of operations grows as the cube of the problem size. For the above code, we can write:
(1)  T_{seq} = Cn^{3} = O(n^{3}) 
Where T_{seq} is the sequential execution time and C is the time for the computational component. We estimate that T_{seq} is "on the order of" n cubed and just use O to represent this approximation. Let's investigate this simple multiplication algorithm when the outer loop is split among N processors. {mosgoogle right}
/* P is current process number */ n1 = (n/N)*P; n2 = (n/N)*(P+1); for(i=n1; i < n2; ++i) for(j=0; j < n; ++j){ c[i][j] = 0; for(k=0; k < n; ++k) c[i][j] += a[i][k]*b[k][j]; }
The time T_{par} for the parallel operations is
(2)  T_{par} = O(  n^{3} N 
) 
where N is the number of processors. In practice, this kind of parallelization will require most of the data to reside on all computation nodes. Each node will require part of the array a[n1: n21][] and the whole array b. In the worst case, the time T_{org} to provision or organize all nodes with data is
(3)  T_{org} = O(n^{2}log N) 
Where O(n^{2}) is the amount of data that needs to be transferred. (array b is I
(4)  T_{total} = T_{par} + T_{org} 
How Many Processors?
An interesting and important question to investigate is the behavior of the algorithm when the number of processors grows. When N is equal to n we get
(5)  T_{par} + T_{org} = O(n^{2}) + O(n^{2}log n) = O(n^{2}log n) 
So one can see that with a large N, number of processors, the number of operations for data handling will exceed the number of operations for calculations! If we calculate speedup and efficiency we get:
(6)  speedup = O(  n log n 
) 
(7)  efficiency = O(  1 log n 
) 
Unfortunately, only a few algorithms have better theoretical speedup when data handling is involved. If you are interested in efficient parallel algorithms a good book to consult is The Design and Analysis of Parallel Algorithms by Selim G. Akl PrenticeHall, Inc., 1989, ISBN: 0132000563)
If you chose to parallelize an existing sequential program, you may save time because much of the program is already written. Another benefit is that because the sequential program works (it should be getting the right answer!), you can compare the results of your parallel version to make sure it is correct. On the other hand, the sequential algorithm used by the program may make parallelization more difficult. For example, consider an alternative matrix multiplication algorithm.
for(i=0; i < n; ++i) i1 = n*i; for(j=0; j < n; ++j){ i2 = n*i; i3 = j; x = 0.; for(k=0; k < n; ++k){ x += a[i2++]*b[i3+=k]; } ++i1; c[i1] = x; }
This code is an implementation of the same matrix multiplication using one dimensional arrays or pointers. In many cases, this sequential code will work a bit faster than other algorithms, but, looking at this program it may be not evident how hard or easy it is to create an efficient parallel program. The data transfers are not as clear as before because a sophisticated index calculation will be needed.
Model. What Model?
In a cluster environment, there are two basic programming models that work well for parallel programs. The first is called Single Program Multiple Data (SPMD) where the program code is the same on all the computational nodes. Only the data may differ. The second scheme is more complicated and is called Multiple Program Multiple Data. Program code may differ on each computational node. Note that, in this case, MPI invocation of the program will require a special file with a description of the different parallel programs and their placement. In either case, the execution of the program on the cluster is similar. As it was noted in previous article, due to the nature of the cluster hardware, the main feature of most cluster programs is message passing.In every case, the programs running on each node are really different programs because they have their own sets of data. For example, when we split or replicate a for loop, each program has its own loop parameter. And when we split the loop one can find that each copy for the program uses a unique loop parameter. This can be seen when running the program in sidebar 2 on a range of one to sixty four processors. For information on running MPI programs consult the MPI Mechanic column.
Sidebar Two: Slicing an Array 
The following C example program will slice an example array index, n, by the number of MPI processes that are started by mpirun. In a real program each node would work on the array from index low to index high. #include

Who Wants My Data?
It is important that data used by the program be placed correctly on each node. Initial data comes from static initializations and input statements. Static initializations are easily placed on all nodes that are used. As for input statements, it is more difficult. In many implementations only one node reads input data and then all or part of the data is sent to all or part of the other nodes. This is typically done by broadcast routines that minimize the amount of data needed to be sent. Implementation of a reliable broadcast depends on the computer architecture. Clusters typically consists of nodes connected by a switch or switches and provide complete nodetonode connections. In this environment, a reliable broadcast works the following way (See the Selim G. Akhl reference above):
 Step 1.
 Node 1 sends data to Node2
 Data is now on nodes 1 to 2
 Step 2.
 Node 1 sends data to Node 3
 Node 2 sends data to Node 4
 Data is now on nodes 1 to 4
 Step n
 Data is on nodes 1 to 2n1
 Node k sends data to Node 2n1+k
 where 1<= k <= 2n1
 Data is on nodes 1 to 2n1
It is easy to see that number of steps for broadcast for N nodes is log(N) (rounded up). This estimation is very important, as it is one of key issues in estimation of the time for parallel algorithms. Data, are computed on nodes and very often some or all of this data are needed on another node. It may be necessary to send or broadcast this data to the other nodes.
A way to reduce broadcasts it to replicate some of the computed data, if possible. If a processor is waiting for broadcast data, then it may be more efficient to calculate the same data and not do the broadcast at all. Let's recall the matrix multiplication program and consider some type of initialization of data. {mosgoogle right}
for(i=0; i < n; ++i){ a[i][j] = (float)(i+j); b[i][j] = (float)(ij); } matmul();
It is possible to do the same kind of parallelization we did for matmul for the initialization for data.
n1 = (n/N)*P; // P is current process number n2 = (n/N)*(P+1); for(i=n1; i < n2; ++i){ a[i][j] = (float)(i+j); b[i][j] = (float)(ij); } matmul();
But after that we will need to provide the entire array b to all nodes. This will cause substantial overhead. Instead of this we can do the following.
n1 = (n/N)*P; // P is current process number n2 = (n/N)*(P+1); for(i=n1; i < n2; ++i) a[i][j] = (float)(i+j); for(i=0; i < n; ++i) b[i][j] = (float)(ij); matmul();
Here calculations of b are replicated. Also notice that write operations are usually made from only one node. This will require provisioning of data as well. When data are replicated on all or several computational nodes each has its own copy. When processing this algorithm we see that only parts of many arrays are used by the parallel programs. In particular, only one of the arrays, b, is used by all programs in the multiplication example. Other arrays, a, and c, are used only partially. We can consider the data of these arrays to be partitioned across the nodes. This method is often referred to as array distribution or array layout.
Perhaps one of the biggest benefits of this method is that it is not necessary to allocate all the memory for all the arrays. It is possible for a node to allocate only a small amount of memory used by certain arrays. The benefit of this method is that it allows for large programs that would never fit into the memory of a single computer to be executed. This result is often reflected in some cluster designs where a larger number of nodes reduce the amount of memory need per node.
Now that we have an understanding of the principles involved in running a program in parallel, we will examine some basic parallelization techniques in the next column.
This article was originally published in ClusterWorld Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux, you may wish to visit Linux Magazine.
Pavel Telegin is currently Department Head of Programming Technology and Methods in the Joint Supercomputer Center, Moscow, Russia.