Hits: 13099

Dynamic Parallel Execution: Losing control at the high end

In a past column, I talked about programming large numbers of cluster nodes. By large, I mean somewhere around 10,000. If you have been following along, at the end of the article I had promised to mention some real alternatives to MPI and even suggest some wild ideas. I plan to keep my promise, however, I wanted to take a slight detour this month and develop the solution a bit further. One point of note before we begin. To keep things simple, I will refer to cluster nodes as if they were a single processing unit.

As you may recall, my conclusion was based on the notion that dependence on large numbers of things increase the chance that one of them will fail. I then proposed that it would be cheaper to develop software that can live with failure than try to engineer hardware redundancy. Finally, I concluded that adapting to failure will require dynamic software. As opposed to statically scheduled programs, dynamic software will adapt at run-time. I should also mention that my motivation is to make programming clusters easier by moving closer to my problem and further away from the minutia of message passing.

A Dynamic Detour

If you think about dynamic code for while all kinds of issues arise. For instance, if I were to create a master application to monitor and keep track of the status of all the nodes (e.g. busy or free) such an application would in one sense be static and represent a single point of control. But wait a minute, if we are creating applications that adapt to failure, then how can we create a master application that must run/interact across the whole cluster. Indeed, on a 10,000 node cluster, how will you communicate real-time node state in a timely fashion and not create congestion on the interconnect. Remember, if applications are running in a dynamic fashion, then their location (resource use) within the cluster can change as the application runs. My guess is the overhead required to track/request this kind of information and make it available to a single master process (i.e. a batch scheduler) will become a limiting factor. Astute readers may recall, that one of the criteria for a "Grid" is that there be no single point of control. Such will be our goal with clusters as well.

How Loops Become Trees

Let's develop our problem a bit more. We will assume the cluster environment is collection of independent computing nodes with a switched interconnect. The interconnect allows any computing node to talk to any other node. We are not going to worry about throughput and latency because in a dynamic environment, they will not effect our discussion. In addition to running programs, the nodes will also have a predefined criteria to communicate their state when asked. To keep things simple, we will allow two states; A busy state that means the nodes cannot help with the current application, and a free state that they will accept work for the current application.

As stated above, we also have no central monitoring hardware. Let's further assume that we can start a program on any node, and the program can ask for assistance from any other node. The response it can expect is either busy of free. Starting from our first node, our dynamic program may employ hundreds of sub-nodes, each of which could, in theory, employ other nodes (sub-sub-nodes) and so on. Our program at run time is beginning to look like an inverted tree. It may help to visualize each branch point as and loop statement. The outer loop will get spawned first as a series of sub-processes, then the inner loop would get spawned as a sub-sub-process.

Slow or Dead Who Cares

Here is where it gets interesting. Ask the question, What is the difference between a branch of the program (e.g. a sub-sub-process) that takes a really long time to complete and a node that fails? I will venture to say that from the perspective of the node that spawned the branch, there is no difference. If after a master node dispatches all possible branches and the node finds that it is "waiting" on a particular branch, then it may choose to seek more resources or use itself (now that it is only waiting). In this case it would seem logical to redo the branch. Oh, the horror of duplicated work, you say. Well, remember, the work only gets duplicated if there are free resources. If for instance, the application has saturated the cluster or (portion of the cluster), then there are no free resources to duplicate the work.

This type of approach will also work in heterogeneous environments where "slow" can be defined as everything from a dead node, to a slow node, to a flaky interconnect. A master node, living in isolation, does not know or care about other nodes. All it knows is that a branch (or branches) it sent to another node is not completing and that there are other resources that may be able to help. Now assuming that at some point, the cluster is saturated, then the "master nodes" will have to become workers and do the work that was given to them by their masters. There are many variations that can be worked into this type of "ignorance is bliss" approach, but the general idea is for each node to get their work done as fast as possible.

Wait a minute, you say, there are sure to be many cases that would just load up the cluster with redundant work. A good question, which means we need to address another more thorny issue -- choosing what nodes to use.

Roll the Dice

At this point, my proposed methodology is staring to sound a bit off-the-wall. As a programmer, I suspect your tendency will be to start adding some kind of control onto this approach. Stay with me here, I invite you to let go of that programmer voice that tells you need to control everything. Take a breath. Good, let's move on.

We are now facing the question, I need to distribute work on a cluster with no central control or scheduler, how do I pick nodes to help me? This question is not trivial. In our scenario, nodes have to ask other nodes if they are idle. We could easily spend much of our time looking for help instead of performing useful work. In my experience, the best way to archive this goal may surprise you, but first, let's look at some other ways we might handle this situation.

If each node has a list of other nodes in the cluster, then it can simply go down the list and ask for help. Assuming there will be several layers of branches or sub-tasks, we might want to consider sending the list of nodes we know are already used to each sub task, so that the sub tasks can skip checking these nodes (to see if they can be used as workers). A handy optimization. We could go further. What if we added a heartbeat between the master and worker nodes and in this heartbeat we communicated the nodes that we were using to our master and then to the other masters. The programmer mind begins to concoct all kinds of interesting ways to track and distribute state information within our program.

Hold on there cowboy. We are heading in the wrong direction. We don't want to centralize anything. As we start developing clever control algorithms, we also start adding runtime overhead to the program. That is, we want to spend a minimum amount of cycles figuring out how to distribute programs and a maximum amount of cycles doing useful work. Now about the heartbeat idea. Heartbeat is a good idea, but recall that a heartbeat only tells me that my worker node is alive, it tells me nothing about the progress of the program (halting problem, notwithstanding). We will come back to this in a moment.

In the past, I spent considerable time trying to work out an optimal way to answer the find an idle node problem. Each time I thought I had found "the way" my approach got more and more complicated. Finally, at one point, I thought, all the fancy ideas seem to no better than just picking any node. Then it dawned on me, the simplest way to find a free node is to pick a node at random. Pages of code became victims of the delete key due to this simple idea. In my later years, I have learned that randomness is a powerful tool when facing problems that have no easy solution. A dart thrown at the stock listings always seems to better than the best stock broker. (Hint: Place a dart board behind the stock listings, don't have the stock broker hold then up.)

About that heartbeat idea. It may be better to call it a handshake instead of a heartbeat because both the master and worker can benefit. Master nodes can send their workers a periodic handshake. If the node does not reply, the master knows for sure that the node is dead and the nodes work must be redone (on the master or on another worker). On the worker node, if a worker stop getting a reply from the master, then it can assume the master is dead or has finished without the worker, and stop working. The node then become free to do other work.

We can now come back to issue of redundant work. A very valid point. A possible refinement is to assign a limit to the number of worker nodes a master may use. lets call this a Branch Number. If this number is zero for all branches (loops), then the program runs on one node. If it is equal to the number of nodes in the cluster, then the program could flood the cluster with redundant work. There is probably an optimum number for each branch point in the program. How is this number determined? There could be several ways, a compiler could guess at this, but it is hard to know how the program will behave for a given cluster at runtime. Another way is to let the program optimize these for a given cluster. It may take several runs, but the program could optimize itself for a specific cluster. All that would be needed is set of list of Branch Numbers for that particular cluster. Dynamic behavior would also help load balance the cluster. If you think of two or more dynamic programs running on a cluster, then they could easily be competing for resources. The end result is a single binary that adapts at run-time to various cluster hardware, cluster loads, and to node failure. Just what I wanted.

As to issues of interconnect type or processor speed. A self tuning dynamic program will run as best it can within a given hardware environment. This means that the number of nodes a program uses will depend on the cluster internals. The idea is that instead of writing codes to use a fixed number of nodes, the program will find the best number that will work on the cluster at hand.

You Have Finally Lost Your Mind

So there you have it, the path is clear, dynamic execution can give us what we really need in the cluster world. (well, at least what I want). Of course, many of you now believe I have have gone off-the-deep-end. Furthermore, I am sure you may have questions or polite insightful comments about this approach to computing. (All of which I would love the hear about, you may contact me through My goal here was not to design new algorithm per se , but rather to throw another log on the cluster discussion fire.

Actually, I must confess, some of the ideas are not that new. A similar type of problem occurs every day in the networking world. As you know, a network is a shared device. How do arbitrary computers connected to an Ethernet network handle the traffic on the network. Instead of some complicated algorithm or central control device that helps manage traffic (that would by itself generate more traffic), Ethernet uses the power of randomness. If an Ethernet device finds the network busy, it waits for a random amount of time, and then retries. Simple, robust and low overhead.

In the future, I believe cluster computing can work the same way, but there is one more thing that has to happen before we can become the ultimate master of our nodes. Next month, we'll talk about how to express your problem so that it is more amenable to dynamic execution. Warning: Some of you are going to get upset.

Note: This article is the second of three part series. The first and third parts are:

This article was originally published in Linux Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.

Douglas Eadline can be found swinging randomly through the trees at here at ClusterMonkey