You can't always get what you want. But that doesn't stop me from asking. Besides, Buddha's got my back. Here's my HPC wish list.
[Note: A recent article by Al Geist, How To Kill A Supercomputer: Dirty Power, Cosmic Rays, and Bad Solder reminds us that statistics at scale represents a big challenge for HPC. The following updated article was originally published in Linux Magazine in May 2006 and offers some further thoughts on this important issue. ]
Twenty five years ago I wrote a short article in a now defunct parallel computing magazine (Parallelogram) entitled "How Will You Program 1000 Processors?" Back then, it was a good question that had no easy answer. Today, it's still a good question with no easy answer, except now it seems a bit more urgent as we step into the "multi-core" era. Indeed, when I originally wrote the article, using 1,000 processors was a far off, but real possibility. Today, 1,000 processors are a reality for many practitioners of high-performance computing (HPC). And as dual-cores (and now 18-core processors) hit the server room, effectively doubling processor counts, many more people will be joining the 1,000P club very soon.
So let's get adventurous and ask, "How will you program 10,000 processors?" As I realized twenty five years ago, such a question may never really have a complete answer. In the history of computers, no one has ever answered such a question to my liking — even when considering ten processors. Of course, there are plenty of methods and ideas like threads, messages, barrier synchronization, and so on, but when I have to think more about the computer than about my problem, something is wrong.
Having spent many a night trying to program parallel computers (the most recent incarnation being clusters), I have come up with a list of qualities that I want in a parallel programming language. Since I am wishing for the moon, I may be asking for the impossible, but I believe some of the features I describe below are going to be necessary before using large numbers of processors becomes feasible for the unwashed masses of potential HPC users.
Failure Is an Option
It is said that the Buddha's last words were "Decay is inherent in all complex/component things." And Buddha wasn't even a system administrator.Clusters are complex/component things. The bigger the cluster, the more things that can decay. A program that routinely uses over 1,000 processors will experience component failures at some point. As a hypothetical example, if you have 1,000 cluster nodes with a Mean Time Between Failure (MTBF) of 10,000 hours (or 1.1 years), you can expect one node to fail every ten hours. Given that the MTBF is fixed for most computer hardware, using more and more processors for your program ultimately becomes a losing proposition.
In the future, I expect clusters to have constant (and expected) failures. Furthermore, the cost to increase the MTBF will probably be prohibitive and adapting to failure will be an easier solution.
I then have to ask, "How the heck do you write a program for hardware you know is going to fail at some point?" The answer is quite simple: the program must tolerate hardware failures. In other words, software must become fault tolerant. And, here is the important part: the programmer should not have write this into the program.
Dynamic Scalability
One way to make a program fault tolerant is to make it dynamically scalable. That is, it can change the number of processors it's using on-the-fly. Adding fault tolerance means redoing work, so some mechanism is needed to dynamically assign processors.
Dynamic scalability is therefore, the next thing I want in my program. The idea is quite simple: I want one program that can run on 10,000 processors as well as one processor. Of course, large problem sizes may not be feasible on one processor. After all, if a large problem requires 10,000 processors for an hour, it would take 1 processor 10,000 hours (assuming that processor had enough memory). I should, however, be able to run a small data set on one processor and then scale the same binary up to a maximal number of processors for that given problem size (and everything in between).
For example, I should be able to develop a program on my laptop and move the same binary over to a thousand-core cluster and run it without any modification. If the cluster is running other programs at the same time and there are only four idle processors, then my program should start using those four. Then, as other processors become available, it should grow only to the point that adding more processors does not help performance. At a later point in time, if I want to run my program with a larger problem size on 1,000 processors, I should be able able to run the same program. Interestingly, the new Hadoop 2 resource manager (YARN) allows programs to negotiate for resources at run-time, i.e. applications can request new cluster cores at run-time, but not necessarily get them, and return idle cores as the application progresses.
No More Standing in Line
Now, because my program is now dynamically scalable, I assume yours is as well. In this case, our programs should be able to co-operate with one another. If we both have a program to run at the same time, we should share resources optimally. In many cases, the need to schedule or queue jobs will not be necessary because the programs manage themselves. My program will constantly negotiate with the other running programs to get the best set of cluster resources. For instance, my program might negotiate to wait while others run, if it can get exclusive access to 1000 cores for one hour. I don't care how the programs do it, I just want them to behave this way and I don't want to have to write such behavior into my program.
Additionally, as part of this dynamic scheme, there should be no central point of control. Programs should behave independently and not have to rely on a single resource. Indeed, within the programs themselves, subparts should be as autonomous as possible. Centrally managing sixteen processors seems reasonable; managing sixteen hundred and having some time to do computation is a real challenge.
And then Some
Finally, I want an expressive language that is free of any artifacts caused by the underlying hardware. I want to be as close to the application I am coding as possible. Thinking in "my problem space" is where I want to live. Concerning myself with memory management, numbers of processors, and other such details takes me away from my problem domain.
In short, that's my wish list: fault tolerant, dynamically scalable, co-operative, and expressive applications. A simple wish, but a tall order. Realizing that I seldom get what I want, I have set my expectations high, so maybe, just maybe, I'll get what I need.
So, how do we to get to this software nirvana? I thought you would never ask. I have some ideas, but first, let's address a few issues that always seem to come up when I talk about this topic.
What about MPI?
Let me be clear: The Message Passing Interface (MPI) and Parallel Virtual Machine (PVM) are wonderful ideas. Both have allowed me and countless others to use collections of processors to achieve great things.
Rest assured message passing will not be eclipsed by a new "programming" technology any time soon. In fact, in all likelihood, message passing will be at the core of most parallel applications in the future because you cannot have parallel computing without communication.
However, as important as MPI is to the HPC world, it does represent a barrier to the domain expert. That is, programming in MPI is too much of an investment for Joe Sixpack programmer. It requires not only code changes, but testing and debugging are harder, and possible major re-writes may be necessary. For those cheering on the sidelines, threads and OpenMP are in the same boat. Sure, the results can be impressive, but complexity is the cost one pays for working at this level.
Even if we manage to produce an army of MPI programmers, there is another more subtle issue that must be addressed. As written, most parallel programs cannot provide a guarantee of efficient execution on every computer. There is no assurance that when I rebuild my MPI/Pthreads/OpenMP program on a different computer it will run optimally. A discussion of this topic is beyond the scope of this column, but let me just say that each cluster or SMP machine has a unique ratio of computation to communication. This ratio determines efficiency and should be considered when making decisions about parallelization. For some applications like rendering, this ratio makes little difference, while in others it can make a huge difference in performance and determining the way you slice and dice your code. Unfortunately, your slicing and dicing may work well on one system, but there is no guarantee it will work well on all systems.
MPI has often been called the machine code for parallel computers. I would have to agree. It is portable, powerful, and unfortunately, in my opinion, too close to the wires for everyday programming. In my parallel computing utopia, MPI and other such methods are as hidden as register loads are in a bash script.
Abstract Art
Climbing above the MPI layer will not come without a cost. Just as there is loss of possible performance when going from assembly language to C, there will be a loss of efficiency when programming without explicit messages. The term often used is a "higher abstraction level". The reasons high-level languages are so popular is because they provide a high level of abstraction above the hardware. Programmers move closer to their application and farther away from the the computer.
In my long forgotten article, I made the case that in the early days of computing there was a huge debate concerning the use of a new language called FORTRAN instead of assembly language (machine code). Yes, in those dark early days, there was no Perl or Python and the new FORmula TRANslation language was a breakthrough idea because it abstracted away some of the machine and let non-programmers like scientists easily program formulas. The argument went something like this:
Assembly Code Wonk: "If I use FORTRAN instead of assembly language, I lose quite a bit of performance, so I will stick with loading my registers, thank you."
FORTRAN Wonk: "Yes, but when the new computer comes next year, I will not have to rewrite my program in a new machine code. And, besides, the new FORTRAN II compiler will optimize my code."
Assembly Code Wonk: "Only time will tell us the best solution. By the way, is that a new pencil thin neck tie you are wearing with a new white short sleeve shirt?"
Time did tell us what happened. FORTRAN (or now written as Fortran) allowed many more people to write code. It also allowed code to spread quicker as new machines came on line. Suddenly there was, and still is by the way, vast amounts of Fortran programs doing all kinds of useful things.
If we are going to open up parallel computing to the domain experts we need to introduce a new abstraction level in which to express problems. My wish is that once the problem is described (or declared) in a new language, compilers and run-time agents can deliver the features I described above.
Cliff Hanging
Now that I have you on the edge of your seat, I need to finish this cliff-hanger in the next article. Not to worry though, I will provide some real alternatives to MPI and even suggest some wild ideas. And maybe if we start thinking and talking about these issues and ideas, we may find an answer to my question.
I am optimistic. After all Buddha also said, "What we think, we become."