Sledgehammer HPC | Parallel Programming

HPC without coding in MPI is possible, but only if your problem fits into one of several high level frameworks.

[Note: The following updated article was originally published in Linux Magazine in June 2009. The background presented in this article has recently become relevant due to the resurgence of things like genetic algorithms and the rapid growth of MapReduce (Hadoop) . It does not cover deep learning.]

Not all HPC applications are created in the same way. There are applications like Gromacs, Amber, OpenFoam, etc. that allow domain specialist to input their problem into an HPC framework. Although there is some work required to "get the problem into the application", these are really application specific solutions that do not require the end user to write a program. At the other end of the spectrum are the user written applications. The starting points for these problems include a compiler (C/C++ or Fortran), an MPI library, and other programming tools. The work involved can range form small to large as the user must concern themselves with the "parallel aspects of the problem". Note: all application software started out at this point some time in the past.

There are also programming languages that attempt to remove the "parallel execution aspects" from the programmers responsibility. These languages are used in a variety of contexts. For example, Linda, ZPL, Cuda, OpenCL, Ct, and others attempt to remove much of responsibility of the parallel execution form the end user. While these languages are valuable, they may represent a challenge for some users and may not be as portable as the more explicit "C/Fortran and MPI solution."

Just above these implied parallel languages, there is a middle ground that I like to call "Sledgehammer HPC." These methods are not so much finished applications as they are a framework for representing your program at an abstraction level that can be easily executed in parallel. This type of approach allows you to focus on your problem at hand and not think about the number of cores or nodes. Some of these methods tend to be a bit domain/problem specific and may not fit for all problems, or it may be difficult to cast a problem in the framework, but if they do work, they allow one to throw the full weight of a cluster at your problem with minimal programming effort.

There are many possible frameworks in this area. In this article, we will cover three that I find most interesting -- Genetic Algorithms, Cellular Automata, and Map-Reduce. One could consider graphics rendering in this category, but ever since the Titanic movie, the Linux Cluster CGI story has been told many times over.

Genetic Algorithms

Genetic Algorithms (GAs) have been around for a while and have been used in bioinformatics, phylogenetics, computational science, engineering, economics, chemistry, manufacturing, mathematics, physics and other fields. The idea is based on evolutionary biology where a population of possible solutions are tested against a fitness function. The best results are saved and the wost are removed from the population. The populations is restored by creating new solutions from old ones using methods that mimic genetic biology -- cross-over and mutation. The fitness test is re-run on the new population and the process repeats. The algorithm terminates when either a maximum number of generations has been produced or a satisfactory fitness level has been reached for the current population. If the algorithm has terminated due to a maximum number of generations, a satisfactory solution may or may not have been reached.

A few things are worth noting. GAs are good at optimizing, however, with any optimization problem, landing on a local maximum is always a problem. The robustness and effectiveness are determined by the genetic methods used to re-seed the population, the population size, and the run time. In addition, solutions are typically represented as binary strings that can be easily manipulated. For instance, a random mutation may be introduced in one individual (candidate solution) by flipping a bit or bits. Likewise, cross-over may take some of portion of one individual and swap it with that from another. While such a low level "twiddling" as a way to solve a high level problem seem inefficient and not even possible, GAs do work. They often require large amounts of computer time, particularly if the fitness function takes a while to run. Of course, in every iteration there will be many combinations that don't work or work poorly as solutions. These apparent "wasted" compute cycles are what give GAs their power as there is no "knowledge" as to what the best combination might be. A slight modification of a poorly performing individual may push it to the top (away from a local maxima). The need for many compute cycles is why GAs are great candidates for parallel HPC.

To get an idea of how a GA works, consider the follow pseudo code representation of a GA (taken from Wikipedia).

Choose initial population
Evaluate the fitness of each individual in the population
Repeat until termination: (time limit or sufficient fitness achieved)

Select best-ranking individuals to reproduce
Breed new generation through crossover and/or mutation (genetic operations) and give birth to offspring
Evaluate the individual fitness of the offspring
Replace worst ranked part of population with offspring

There are several areas for parallelization in a GA. First, each individual is independent until it is time to cull and re-seed the population. This is a potential for parallel execution, but there may be a better way if the fitness function requires a lot of compute cycles. If the population is large, and step 3 above requires a non-trivial amount of CPU cycles, then it is quite easy parallelize the fitness function tests. In many cases, the fitness function represents a reduction of data to a single number that can be used for evaluating solutions. Therefore, very little information needs to be transmitted once the fitness function is computed.

As you can infer, the user is required to create the fitness function and design the data representation to be used as the individuals in the population. While this may take some thought, once the problem is caste in "GA terms" you can unleash the full power of a cluster on your GA. Need a more power, no need to touch the code, just add some nodes.

Before we leave GA land, letcolumns/parallel/map-reduce.png"s consider a simple example. Most compiler have more options than any human can manage. To test various options and option combinations you need to try each combination and see which one produces the best run time. Tedious to say the least. One could "script up" such an experiment, but why not try a GA. The fitness function would be compiling, running the program, and recording the time. The options (individuals) representing the highest times are removed, new individuals are created, and fitness tests run again, repeating the cycle. While I am glossing over details, you get the general idea. If you want to get really fancy, throw in MPI options (Open MPI is good candidate!). In this case you may need several nodes to calculate your fitness function.

If you want to play with parallel GAs you can check out PGAPack (Parallel Genetic Algorithm Package) a general-purpose, data-structure-neutral, parallel genetic algorithm library developed at Argonne National Laboratory. There are others of course (Cluster Documentation Project). I'm not sure which is the best one. Maybe, I'll write a GA to find out. More information is available at genetic-programming.org

Cellular Automata

Most people know Cellular Automata (CA) from the game of Life developed by mathematician John Conway at Cambridge in the late 1960′s and popularized by Martin Gardner in Scientific American. The concept is quite simple. A grid of cells each have a state. The state of a cell is determined by using a set of rules whose input are the state of neighboring cells. Each generation represents the application of the rules to all cells. The game of Life uses a simple set of rules and helps illustrate some basic is ideas behind CA. When using the right set of rules, complex behavior can be observed for a CA. This complexity from simplicity is what makes CA so interesting.

To illustrate a CA, the rules for Life are given below. Consider a square 2D grid (although CAs have been run on triangular or hexagonal 1D and 3D grids as well). In the game of Life, a cell is either alive (on state) or dead (off state). An initial pattern is usually placed on the grid and then the automata is started. Each generation represents one pass over the entire grid.

If a space is populated (on state):
- Any cell with one or zero neighbors dies, (death from loneliness).
- Any cell with four or more neighbors dies, (death from overpopulation).
- Any cell with two or three neighbors survives.
For a space that is unpopulated (off state)
- Any empty cell with three neighbors becomes populated. (birth)

If you want to play with Life on-line you can go to sites like this one and enter patterns to see what happens. One thing to note is that small variations in patterns can cause huge changes in outcomes.

A popular and controversial book by Stephen Wolfram called A New Kind of Science was mostly about the properties and possibility of CAs. In general, CA works well when you want to model complex things with lots of little parts. Of course there are other ways to model complex things, but CA has one advantage that it starts out with "simple rules" and generates complex results. There are CA's for fluid flow, ecosystems, populations, environmental systems, urban growth, and even traffic. There are other areas as well. CA are often used to study "emergent systems" -- those systems that start out simple but then self organize.

As you can imagine a CA can be executed in parallel. There are really no restrictions on how far a neighbors interaction can reach, but in general, most interactions are very local, i.e. interactions take place between close neighbors. This type of interaction keeps the dataflow local and thus makes parallel execution possible. If the grid is broken in to parts, only the boundaries need to communicate. Like, Genetic Algorithms, CAs are naturally parallel and can take advantage of the cheap and plentiful parallel computing cycles that clusters offer. There are even parallel CA programming languages like CAOS (which stands for Cells, Agents and Observers for Simulation).

There is a certain appeal to using CA as a tool to simulate large systems. Indeed, CA's can even produce pseudo randomness as part of their behavior. Cellular automata provides an alternative method to study the behavior of dynamical systems. Traditionally such systems are studied with differential equations. While differential equations are the work horse of such systems, difficulties arise from non-linear behavior, round-off errors, and variations in initial conditions. Because CAs are "simple rules" they are easier to analyze than differential equations and can tolerate modifications without affecting the stability of the analysis.

Keep the CA approach in mind when you have some extra cycles and want to play with an new idea. For instance, what if you try to simulate cluster queuing algorithm. You never know what may happen. The often noted problem with CA's, which is similar to Genetic Algorithms, is you can find something that works, but you are not sure why. If you want find more information about CA's then have a look at a 2003 Survey of Cellular Automata or take a gander the old but still useful Cellular Automata FAQ

MapReduce

Our last sledgehammer approach to HPC is currently quite popular in big data analytics. If you have searched for anything on the web, you have probable used (in the background) a MapReduce algorithm. The idea is basically the same thing *nix users have been doing for years, but MapReduce in applications like Apache Hadoop can do it on much larger scale. For those interested in a good starting point consider Hadoop2 Quick Start Guide . Hadopp version 1 is a free implementation of MapReduce at scale although it can be implemented in any language. Hadoop version 2 support MapReduce and many other scalable methods. In particular, Apache Spark is particularly popular language for MapReduce and many other analytics operations. In terms of MapReduce, consider the following "piped" set of commands:

cat * | grep | sort | unique -c | cat > file

Nothing fancy there. Now try this on 20 TBytes of data. MapReduce works by distributing the work across multiple servers. Think of it as a partitioned file system where files are broken in to large blocks. Each block is often replicated on several data nodes (for reliability). With this arrangement, searching can be done in parallel. Figure One shows the general MapReduce algorithm.

MapReduce is Single Instruction Multiple Data (SIMD) approach to parallel computing.

In terms of our command string above, the MapReduce algorithm can be thought of as the following

cat *    |  grep |    sort    |   unique -c   | cat > file
input   ->  Map  ->  shuffle ->    Reduce   ->  output

A MapReduce operation works with key/value pairs. It is primarily used for log processing, web search indexing and ranking, and Ad-hoc data queries. Because the search is distributed and independent, MapReduce algorithms can have near linear scalability. As mentioned the idea is simple, by breaking the data up across servers (hard drives) you can spin many disks at that same time. Another MapReduce trick is to move the processing to the data and not the other way around. This drastically improves performance. Another key aspect of MapReduce is the assumption that the data is static, i.e. it is not going to change while the search is in process. Because the data are often redundantly stored, there is also a level of fault tolerance in a MapReduce clusters.

The most popular MapReduce package is the the Apache Foundation Hadoop project. (Named for project creator Doug Cutting's child's stuffed elephant.) Hadoop has three major components, The Hadoop Distributed File System (HDFS), A resource scheduler (YARN), and MapReduce engine framework.

The HDFS is built from a cluster of DataNodes and is designed for streaming file access (i.e. not for user home directories). DataNodes can talk to each other to re-balance data, to move copies around, and to keep the replication of data high. There is also a central NameNode serves as both directory namespace manager and "inode table" for the HDFS. HDFS offers high availability, federation, snapshots, NFSv3 mounts and more. .

The MapReduce Engine sits above YARN and the HDFS. Clients submit MapReduce jobs to YARN which attempts to keep the work as close to the data as possible. HDFS is rather unique in that it is "rack aware." MapReduce gets its performance from starting multple "maps" on the DatNodes. If an individual amp or reduce fails or times out, that part of the job is rescheduled. Hadoop is an active project and represents a powerful way to search through reams of data quickly.

If I Had A Hammer

While the three methods I mention above are not typical uses for HPC clusters, they are none the less capable of throwing large amounts of compute power at an appropriate problem. In each case, the user is also removed from much of the parallel computing minutia all to familiar to many HPC users. Such an advantage only really works if you swing your hammer and hitting a round peg in a round hole. If you have a square peg and round hole, a sledgehammer may make it fit, but it may not give the best result. Swing carefully.