Dynamic Parallel Execution: Losing control at the high end
In a past column, I talked about programming large numbers of cluster nodes. By large, I mean somewhere around 10,000. If you have been following along, at the end of the article I had promised to mention some real alternatives to MPI and even suggest some wild ideas. I plan to keep my promise, however, I wanted to take a slight detour this month and develop the solution a bit further. One point of note before we begin. To keep things simple, I will refer to cluster nodes as if they were a single processing unit.
As you may recall, my conclusion was based on the notion that dependence on large numbers of things increase the chance that one of them will fail. I then proposed that it would be cheaper to develop software that can live with failure than try to engineer hardware redundancy. Finally, I concluded that adapting to failure will require dynamic software. As opposed to statically scheduled programs, dynamic software will adapt at run-time. I should also mention that my motivation is to make programming clusters easier by moving closer to my problem and further away from the minutia of message passing.
A Dynamic Detour
If you think about dynamic code for while all kinds of issues arise. For instance, if I were to create a master application to monitor and keep track of the status of all the nodes (e.g. busy or free) such an application would in one sense be static and represent a single point of control. But wait a minute, if we are creating applications that adapt to failure, then how can we create a master application that must run/interact across the whole cluster. Indeed, on a 10,000 node cluster, how will you communicate real-time node state in a timely fashion and not create congestion on the interconnect. Remember, if applications are running in a dynamic fashion, then their location (resource use) within the cluster can change as the application runs. My guess is the overhead required to track/request this kind of information and make it available to a single master process (i.e. a batch scheduler) will become a limiting factor. Astute readers may recall, that one of the criteria for a "Grid" is that there be no single point of control. Such will be our goal with clusters as well.How Loops Become Trees
Let's develop our problem a bit more. We will assume the cluster environment is collection of independent computing nodes with a switched interconnect. The interconnect allows any computing node to talk to any other node. We are not going to worry about throughput and latency because in a dynamic environment, they will not effect our discussion. In addition to running programs, the nodes will also have a predefined criteria to communicate their state when asked. To keep things simple, we will allow two states; A busy state that means the nodes cannot help with the current application, and a free state that they will accept work for the current application.As stated above, we also have no central monitoring hardware. Let's further assume that we can start a program on any node, and the program can ask for assistance from any other node. The response it can expect is either busy of free. Starting from our first node, our dynamic program may employ hundreds of sub-nodes, each of which could, in theory, employ other nodes (sub-sub-nodes) and so on. Our program at run time is beginning to look like an inverted tree. It may help to visualize each branch point as and loop statement. The outer loop will get spawned first as a series of sub-processes, then the inner loop would get spawned as a sub-sub-process.
Slow or Dead Who Cares
Here is where it gets interesting. Ask the question, What is the difference between a branch of the program (e.g. a sub-sub-process) that takes a really long time to complete and a node that fails? I will venture to say that from the perspective of the node that spawned the branch, there is no difference. If after a master node dispatches all possible branches and the node finds that it is "waiting" on a particular branch, then it may choose to seek more resources or use itself (now that it is only waiting). In this case it would seem logical to redo the branch. Oh, the horror of duplicated work, you say. Well, remember, the work only gets duplicated if there are free resources. If for instance, the application has saturated the cluster or (portion of the cluster), then there are no free resources to duplicate the work.This type of approach will also work in heterogeneous environments where "slow" can be defined as everything from a dead node, to a slow node, to a flaky interconnect. A master node, living in isolation, does not know or care about other nodes. All it knows is that a branch (or branches) it sent to another node is not completing and that there are other resources that may be able to help. Now assuming that at some point, the cluster is saturated, then the "master nodes" will have to become workers and do the work that was given to them by their masters. There are many variations that can be worked into this type of "ignorance is bliss" approach, but the general idea is for each node to get their work done as fast as possible.
Wait a minute, you say, there are sure to be many cases that would just load up the cluster with redundant work. A good question, which means we need to address another more thorny issue -- choosing what nodes to use.
Roll the Dice
At this point, my proposed methodology is staring to sound a bit off-the-wall. As a programmer, I suspect your tendency will be to start adding some kind of control onto this approach. Stay with me here, I invite you to let go of that programmer voice that tells you need to control everything. Take a breath. Good, let's move on.We are now facing the question, I need to distribute work on a cluster with no central control or scheduler, how do I pick nodes to help me? This question is not trivial. In our scenario, nodes have to ask other nodes if they are idle. We could easily spend much of our time looking for help instead of performing useful work. In my experience, the best way to archive this goal may surprise you, but first, let's look at some other ways we might handle this situation.
If each node has a list of other nodes in the cluster, then it can simply go down the list and ask for help. Assuming there will be several layers of branches or sub-tasks, we might want to consider sending the list of nodes we know are already used to each sub task, so that the sub tasks can skip checking these nodes (to see if they can be used as workers). A handy optimization. We could go further. What if we added a heartbeat between the master and worker nodes and in this heartbeat we communicated the nodes that we were using to our master and then to the other masters. The programmer mind begins to concoct all kinds of interesting ways to track and distribute state information within our program.
Hold on there cowboy. We are heading in the wrong direction. We don't want to centralize anything. As we start developing clever control algorithms, we also start adding runtime overhead to the program. That is, we want to spend a minimum amount of cycles figuring out how to distribute programs and a maximum amount of cycles doing useful work. Now about the heartbeat idea. Heartbeat is a good idea, but recall that a heartbeat only tells me that my worker node is alive, it tells me nothing about the progress of the program (halting problem, notwithstanding). We will come back to this in a moment.