from the random thoughts department
[Note: The following updated article was originally published in Linux Magazine in June 2006 and offers some further thoughts on the concept of dynamic execution.]
In a previous article, I talked about programming large numbers of cluster nodes. By large, I mean somewhere around 10,000. To recap quickly, I pointed out that dependence on large numbers of things increase the chance that one of them will fail. I then proposed that it would be cheaper to develop software that can live with failure than try to engineer hardware redundancy. Finally, I concluded that adapting to failure requires dynamic software. As opposed to statically scheduled programs, dynamic software adapts at run-time. The ultimate goal is to make cluster programming easier: focus more on the problem and less on the minutiae of message passing. (Not that there is anything wrong with message passing or MPI. At some level messages (memory) needs to be transferred between cluster nodes.)
At the end of previous article, I promised to mention some real alternatives to the Message Passing Interface (MPI) and even suggest some wild ideas. I plan to keep my promise, but I want to take a slight detour and develop the solution a bit further. One item of business: To keep things simple, I will refer to cluster nodes as if they were a single processing unit and ignore the multi-core nature of modern servers.
A Dynamic Detour
If you think about dynamic code for a while, all kinds of issues arise. For instance, if I were to create a master application to monitor and keep track of the status of all the nodes (say, busy or free), such an application would, in one sense, be static and represent a single point of control. But wait a minute! If I want to create applications that adapt to failure, how can I create a master application that must run and interact across the whole cluster? Indeed, on a 10,000 node cluster, how do you communicate real-time node state in a timely fashion and not create congestion on the interconnect? Remember, if an application is dynamic, its location (resource use) within the cluster can change as the application runs. My guess is the overhead required to track and request this kind of information and make it available to a single master process (i.e. a batch scheduler) is a limiting factor. Astute readers may recall that one of the criteria for a "grid" is that there be no single point of control. This enviable goal should extend to clusters as well.
How Loops Become Trees
Let's develop our problem a bit more. Assume the cluster environment is a collection of independent computing nodes with a switched interconnect. The interconnect allows any computing node to talk to any other node. For the moment, don't worry about throughput and latency, because in a dynamic environment, those factors don't affect the discussion.
In addition to running programs, each node also has a predefined criteria to communicate its state when asked. To keep things simple, presuppose just two states; A busy state that means the node cannot help with the current application and a free state that indicates it's available to work on the current application.
As stated above, there is also no central monitoring/scheduling hardware or software. Let's further assume that a program can start on any node and that the program can ask for assistance from any other node. Responses can either be "busy" or "free." Starting from the first node, a dynamic program may employ hundreds of subnodes, each of which could, in theory, employ other nodes ("sub-subnodes") and so on. Hence, a program at run time resembles an inverted tree. It may help to visualize each branch point as a loop statement. The outer loop is spawned first as a series of subprocesses, and each inner loop is spawned as a sub-subprocess.
Slow or Dead? Who Cares
Now here's where it gets interesting. Ask the question, "What is the difference between a branch of the program (a sub-subprocess) that takes a really long time to complete and a node that fails?" I'd venture to say that from the perspective of the node that spawned the branch, there's no difference. If after a master node dispatches all possible branches and the node finds that it's "waiting" on a particular branch, then it may choose to seek more resources or use itself (now that it is only waiting). In this case it would seem logical to redo the branch. "Oh, the horror of duplicated work! " you say. Well, remember, the work only gets duplicated if there are free resources. If, for instance, the application has saturated the cluster or (portion of the cluster), then there are no free resources to duplicate the work.
This type of approach also works in heterogeneous environments where "slow" can be defined as everything from a dead node, to a slow node, to a flaky interconnect. A master node, living in isolation, doesn't know or care about other nodes. All it knows is that a branch (or branches) it sent to another node isn't completing and that there are other resources that may be able to help. Now, assuming that at some point, the cluster is saturated, then the "master nodes" have to become workers and do the work that was given to them by their masters. Since this work was originally expressed as branched trees, it needs to now be executed as as efficient loops on a single processor using a technique called optimized tail recursion. There are many variations that can be worked into this type of "ignorance is bliss" approach, but the general idea is for each node to get its work done as fast as possible. Wait a minute, you say. Won't there be many cases that just load up the cluster with redundant work? A good question, which means we need to address another more thorny issue — choosing what nodes to use.
Roll the Dice
At this point, my proposed methodology is staring to sound a bit off-the-wall. If you're a programmer, I suspect your tendency will be to start adding some kind of control onto this approach. Stay with me here, I invite you to let go of that programmer voice that tells you need to control everything. Take a breath. Good. Let's move on.
We are now facing the question: I need to distribute work on a cluster with no central control or scheduler, how do I pick nodes to help me? This question is not trivial. In the proposed scenario, nodes have to ask other nodes if they are idle. We could easily spend much of our time looking for help instead of performing useful work. In my experience, the best way to achieve this goal may surprise you, but first, let's look at some other ways we might handle this situation.
If each node has a list of other nodes in the cluster, then it can simply go down the list and ask for help. Assuming that there are several layers of branches or sub-tasks, you might want to consider sending the list of known busy nodes to each sub task, so that the sub tasks can skip checking these nodes (to see if they can be used as workers). A handy optimization. But, you can go further. What if you added a heartbeat between the master and worker nodes and used this heartbeat to communicate the nodes that the program is using to its master and then to the other masters. The programmer mind begins to concoct all kinds of interesting ways to track and distribute state information within our program.
But hold on there, cowboys and cowgirls. That's heading in the wrong direction. You don't want to centralize anything. As you start developing clever control algorithms, you also start adding run-time overhead to the program. That is, you want to spend a minimum amount of cycles figuring out how to distribute programs and a maximum amount of cycles doing useful work. Now, about the heartbeat idea. Heartbeat is a good idea, but recall that a heartbeat only tells me that my worker node is alive, it tells me nothing about the progress of the program (halting problem, notwithstanding). Let's come back to this in a moment.
In the past, I spent considerable time trying to work out an optimal way to answer the find an idle node problem. Each time I thought I had found "the way," my approach got more and more complicated. Finally, at one point, I thought, all the fancy ideas seem to no better than just picking any node. Then it dawned on me that the simplest way to find a free node is to pick a node at random. Pages of code became victims of the delete key due to this simple idea. In my later years, I have learned that randomness is a powerful tool when facing problems that have no easy solution. A dart thrown at the stock listings always seems to better than the best stock broker. (Hint: Place a dart board behind the stock listings, don't have the stock broker hold them up.)
About that heartbeat idea: It may be better to call it a handshake instead of a heartbeat, because both the master and worker can benefit. Master nodes can send their workers a periodic handshake. If a node doesn't reply, the master knows for sure that the node is dead and that the node's work must be redone (on the master or on another worker). On the worker node, if a worker stops getting a reply from the master, then it can assume the master is dead or has finished without the worker, and can cease working. The node then become free to do other work.
We can now come back to issue of redundant work, which is a very valid point. A possible refinement is to assign a limit to the number of worker nodes a master may use. Let's call this number a branch number. If this number is zero for all branches (loops), then the program runs on one node. If it is equal to the number of nodes in the cluster, then the program could flood the cluster with redundant work.
There's probably an optimum number for each branch point in the program. How is this number determined? There could be several ways. A compiler could guess at this, but it is hard to know how the program will behave for a given cluster at run-time. Another way is to let the program optimize these for a given cluster. It may take several runs, but the program could optimize itself (learn) for a specific cluster. All that would be needed is set of approximate branch numbers (or some other metrics that could be related to performance) for that particular cluster. Dynamic behavior would also help load balance the cluster. If you think of two or more dynamic programs running on a cluster, then they could easily be competing for resources. The end result of this random design is a single executable that adapts at run-time to various cluster hardware, cluster loads, and to node failure. Just what I wanted.
Concerning the issues of interconnect type or processor speed, a self-tuning dynamic program will run as best it can within a given hardware environment. This behavior means that the number of nodes a program uses depends on the cluster internals and load. The idea is that instead of writing codes to use a fixed number of nodes, the program finds the best number that will work on the cluster at hand.
You Have Finally Lost Your Vulcan Mind
So there you have it: The path is clear, dynamic execution can give us what we really need in the cluster world. (Well, at least what I want). Of course, many of you now believe I have have gone off-the-deep-end. Furthermore, I'm sure you may have questions or polite insightful comments about this approach to parallel computing. My goal here was not to design a new algorithm per se, but rather to throw another log on the cluster discussion fire.
Actually, I must confess, some of the ideas are not that new. A similar type of problem occurs every day in the networking world. As you know, a network is a shared device. How do arbitrary computers connected to a shared Ethernet network handle the traffic on the network? Instead of some complicated algorithm or central control device that helps manage traffic (that would by itself generate more traffic), Ethernet uses the power of randomness. If an Ethernet device finds the network busy, it waits for a random amount of time and then retries. Simple, robust, and inexpensive.
In the future, I believe cluster computing can work the same way, but there is one more thing that has to happen before we can become the ultimate master of our nodes. In a future article, I'll talk about how to express your problem so that it is more amenable to dynamic execution. Warning: Some of you are going to get upset.