More parallel secrets; speedup, efficiency and your code
In the previous column, we found that parallel programs for clusters have very subtle differences and their efficiency requires careful examination of the code. In this article, we will see what a typical parallel program looks like and how it is executed on a cluster. Be warned, however, there is a bit of gentle mathematics in this column. It will not hurt, we promise.
For simplicity, let us consider that each cluster node contains one processor and each node runs one program. Parallel computation occurs because the programs are communicating with other nodes. Using nodes with two and more processors on each node will work as well as long as the number of programs matches the number of processors. If we ignore memory contention and possible local communication optimizations, then we assume each programs occupies its own memory and we can think about it as running on a separate node with local memory. Therefore, a node with several processors can be considered to contain several virtual uniprocessor nodes. The possibility of programming multiprocessor nodes using threads will be discussed in future columns.
To estimate how much faster parallel program is than sequential program we typically use speedup and efficiency. See Sidebar One for the definition of these terms as we will be using them quite often in our discussions about parallel programming.
Choose Wisely
There are two basic ways to write a new program for clusters. The first way is when you have an existing sequential program and want to make it parallel. The second way is to write a new parallel program. Both ways are widely used when programming clusters. We will not discuss which way is better as it all depends on how much work is required. Generally speaking, if you write a new parallel program, you can make it more efficient. Of course, a parallel algorithm should be used. To understand the algorithm it is necessary to look at the number of operations as function of problem size, i.e. algorithmic complexity. For example, let us look a matrix multiplication.
Sidebar One: Speedup and Efficiency |
The terms speedup and efficiency are often used when talking about clusters. The following definitions will be used in this column. speedup = sequential run-time/parallel run-time program efficiency = speedup/number of processors Speedup shows how much faster a parallel version of the program is compared to a sequential version. Efficiency shows how well the program scales on a cluster. |
void matmul(){ ... for(i=0; i < n; ++i) for(j=0; j < n; ++j){ c[i][j] = 0; for(k=0; k < n; ++k) c[i][j] += a[i][k] * b[k][j]; } }
It is quite clear that the number of operations depends on the number of iterations in all three loops. As loops are nested, the number of operations grows as the cube of the problem size. For the above code, we can write:
(1) | T_{seq} = Cn^{3} = O(n^{3}) |
Where T_{seq} is the sequential execution time and C is the time for the computational component. We estimate that T_{seq} is "on the order of" n cubed and just use O to represent this approximation. Let's investigate this simple multiplication algorithm when the outer loop is split among N processors.
/* P is current process number */ n1 = (n/N)*P; n2 = (n/N)*(P+1); for(i=n1; i < n2; ++i) for(j=0; j < n; ++j){ c[i][j] = 0; for(k=0; k < n; ++k) c[i][j] += a[i][k]*b[k][j]; }
The time T_{par} for the parallel operations is
(2) | T_{par} = O( | n^{3} N |
) |
where N is the number of processors. In practice, this kind of parallelization will require most of the data to reside on all computation nodes. Each node will require part of the array a[n1: n2-1][] and the whole array b. In the worst case, the time T_{org} to provision or organize all nodes with data is
(3) | T_{org} = O(n^{2}log N) |
Where O(n^{2}) is the amount of data that needs to be transferred. (array b is I
(4) | T_{total} = T_{par} + T_{org} |
How Many Processors?
An interesting and important question to investigate is the behavior of the algorithm when the number of processors grows. When N is equal to n we get
(5) | T_{par} + T_{org} = O(n^{2}) + O(n^{2}log n) = O(n^{2}log n) |
So one can see that with a large N, number of processors, the number of operations for data handling will exceed the number of operations for calculations! If we calculate speedup and efficiency we get:
(6) | speedup = O( | n log n |
) |
(7) | efficiency = O( | 1 log n |
) |
Unfortunately, only a few algorithms have better theoretical speedup when data handling is involved. If you are interested in efficient parallel algorithms a good book to consult is The Design and Analysis of Parallel Algorithms by Selim G. Akl Prentice-Hall, Inc., 1989, ISBN: 0132000563)
If you chose to parallelize an existing sequential program, you may save time because much of the program is already written. Another benefit is that because the sequential program works (it should be getting the right answer!), you can compare the results of your parallel version to make sure it is correct. On the other hand, the sequential algorithm used by the program may make parallelization more difficult. For example, consider an alternative matrix multiplication algorithm.
for(i=0; i < n; ++i) i1 = n*i; for(j=0; j < n; ++j){ i2 = n*i; i3 = j; x = 0.; for(k=0; k < n; ++k){ x += a[i2++]*b[i3+=k]; } ++i1; c[i1] = x; }
This code is an implementation of the same matrix multiplication using one dimensional arrays or pointers. In many cases, this sequential code will work a bit faster than other algorithms, but, looking at this program it may be not evident how hard or easy it is to create an efficient parallel program. The data transfers are not as clear as before because a sophisticated index calculation will be needed.