But it does not stop me from asking
Fifteen years ago I wrote a short article in a now defunct parallel computing magazine (Parallelogram) entitled "How Will You Program 1000 Processors?" Back then it was a good question that had no easy answer. Today, it is still a good question that still has no easy answer. Except now it seems a bit more urgent as we step into the "mulit-core" era. Indeed, when I originally wrote the article, using 1000 processors was a far off, but real possibility. Today, 1000 processors are a reality for many practitioners of HPC. As dual cores hit the server rooms, effectively doubling the processor counts, many more people will be joining the 1000P club very soon.
So let's get adventurous and ask, "How will you program 10,000 processors?" As I realized fifteen years ago, such a question may never really have a complete answer. In the history of computers, no one has ever answered the question to my liking -- even when considering ten processors. Of course there are plenty of methods and ideas like threads, messages, barrier synchronization, etc., but when I have to think more about the computer than about my problem, something is wrong.
Having spent many a night trying to program parallel computers (the most recent incarnation being clusters) I have come up with a list of qualities that I want in a parallel programming language. Since I am wish for the moon, I may be asking for the impossible, but I believe some of the features I describe below are going to be necessary before using large number of processors will become feasible for the unwashed masses of potential HPC users.
In the future, I expect clusters to have constant (and expected) failures. Furthermore, the cost to increase the MTBF will probably be prohibitive and adapting to failure will be an easier solution.
I then have to ask, "How the heck do write a program for hardware you know is going to fail at some point?" The answer is quite simple, the program will have to tolerate hardware failures. In other words software must become fault tolerant. And, here is the important part, I the programmer should not have write this into my program.
For example, if I should be able to develop a program on my laptop and move the same binary over to a sixteen processor cluster and run it without any modification. If the cluster is running other programs at the same time and there are only four idle processors, then my program should start using these four. As other processors become available it should grow only to the point that adding more processors does not help performance. At a later point in time, if I want to run my program with a larger problem size on 1000 processors, I should be able able to run the same program.
Additionally, as part of this dynamic scheme there should be no central point of control. Programs should behave independently and not have to rely on a single resource. Indeed, within the programs themselves subparts should be as autonomous as possible. Centrally managing sixteen processors seems reasonable, managing sixteen hundred and having some time to do computation is a real challenge.
Even if we manage to produce an army of MPI programmers, there is another more subtle issue that must be addressed. As written, most parallel programs cannot provide a guarantee of efficient execution on every computer. There is no assurance that when I rebuild my MPI/Pthreads/OpenMP program on a different computer it will run optimally. A discussion of this topic is beyond the scope of this column, but let me just say, that each cluster or SMP machine has a unique ratio of computation to communication. This ratio determines efficiency and should be considered when making decisions about parallelization. For some applications like rendering, this ratio makes little difference, in others it can make a huge difference in performance and determine the way you slice and dice your code. unfortunately, your slicing and dicing may work well on one system, but there is no guarantee it will work well on all systems.
MPI has often been called the machine code for parallel computers. I would have to agree. It is portable, powerful, and unfortunately, in my opinion, too close to the wires for everyday programming. In my parallel computing utopia, MPI and other such methods are as hidden as register loads are in a bash script.
In my long forgotten article, I made the case that in the early days of computing there was a huge debate concerning the use of a new language, called FORTRAN, instead of assembly language (machine code). Yes, in those dark early days, there was no Perl or Python and the new FORmula TRANslation language was a breakthrough idea because it abstracted away some of the machine and let non-programmers like scientists easily program formulas. The argument went something like this:
Assembly Code Wonk: "If I use FORTRAN instead of assembly language, I loose quite a bit of performance, so I will stick with loading my registers thank you."
FORTRAN Wonk: "Yes, but when the new computer comes next year, I will not have to rewrite my program in a new machine code. And, besides, the new FORTRAN II compiler will optimize my code."
Assembly Code Wonk: "Only time will tell us the best solution. By the way, is that new pencil thin neck tie you are wearing with a new white short sleeve shirt?" {mosgoogle right}
Time did tell us what happened. FORTRAN (or now written as Fortran) allowed many more people to write code. It also allowed code to spread quicker as new machines came on line. Suddenly there was, and still is by the way, vast amounts of Fortran programs doing all kinds of useful things.
If we are going to open up parallel computing to the domain experts we need to introduce a new abstraction level in which to express their problem. My wish is that once the problem is described (or declared) in a new language, compilers and run-time agents can deliver the features I described above.
Note: This article is the first of three part series. The second and third parts are:
This article was originally published in Linux Magazine. It has been updated and formatted for the web. If you want to read more about HPC clusters and Linux you may wish to visit Linux Magazine.
Douglas Eadline is editor of ClusterMonkey and does wear a pencil thin necktie, although there is one in his closet somewhere.