Is Hadoop the New HPC? | Opinions

Hadoop has been growing clusters in datacenters at a rapid pace. Is Hadoop the new corporate HPC?

Apache Hadoop has been generating a lot of headlines lately. For those that are not aware, Hadoop is an open source project that provides a distributed file system and MapReduce framework for massive amounts of data. The primary hardware used for Hadoop is clusters of commodity servers. File sizes can easily be in the petabyte range and can easily use hundreds or thousands of compute servers.

Hadoop also has many components that live on top of the core Hadoop File System (HDFS) and the MapReduce mechanism. There are some interesting similarities between an HPC cluster and a Hadoop cluster, but how much cross over between the two disciplines depends on the application. Hadoop strengths lie in the shear size of data it can process. It also is highly redundant and can tolerate node failures without halting user jobs.

Who Uses Hadoop

There are many organizations that use Hadoop on a daily basis including Yahoo, Facebook, American Airlines, Ebay, and many others. Hadoop is designed to allow users to manipulate large unstructured or unrelated data sets. It is not intended to be a replacement for a RDMS. For example, Hadoop can be used to scan weblogs, on-line transaction data, or web content. All of which are growing each year.

Map Reduce

To many HPC users, MapReduce is a methodology used by Google to process large amounts of web data. Indeed, the now famous Google MapReduce paper was the inspiration for Hadoop.

The MapReduce idea is quite simple and when used in parallel can provide extremely powerful search and compute capabilities. There are two major steps in the MapReduce process. If you have not figured it out, there is a "Map" step followed by a "Reduce" step. Some are surprised to learn that mapping is done all the time in the *nix world. For instance consider:

grep "the" file.txt

In this simple example we are "mapping" all the occurrences, by line, for the word "the" in a text file. Though the task seems somewhat trivial, suppose the file was one terabyte. How could we speed up the "mapping" step. The answer is also simple, break the file into chunks and put a different chunk on a separate computer. The results can be combined when the job is finished as there are no dependencies within the map step. The popular MPIBLAST takes the same approach by breaking the human genome file into chunks and performing the "BLAST mapping" on separate cluster nodes.

Suppose we wanted to calculate the total number of lines with "the" in them. The simple answer is to pipe the results into wc (word count):

grep "the" file.txt | wc -l

We have just introduced a "Reduce" step. For our large file parallel mode, each computer would perform the above step (grep and wc) and send the count to the master node. That in nutshell is how MapReduce works. Of course, there are more details, like key-value pairs and "the shuffle", but for the purposes of this discussion, MapReduce can be that simple.

With Hadoop, large files are placed in HDFS which automatically breaks the file into chunks and spreads them across the cluster (usually in a redundant fashion). In this way, parallelizing the map process is trivial, all that needs to happen is to place s separate map process on each node with file chunk. The results are then sent to reduce processes, which also run on the cluster. As you can imagine, large files produce large amounts of intermediate data, thus multiple reducers help keep things moving. There are some aspects to the MapReduce process worth noting:

MapReduce can be transparently scalable. The user does not need to manage data placement or the number of nodes used for their job. There is no dependency on the underlying hardware.
Data flow is highly defined and in one direction from the map to the reduce, there is no communication between independent mapper or reducer processes.
Since processing is independent, fail over is trivial, a failed process can be restarted, provided the underlying file system is redundant like HDFS
MapReduce, while powerful, does not fit all problem types.

In order to understand the difference between Hadoop and a typical HPC cluster, let's compare several aspects of both systems.

Hardware

Many modern HPC clusters and Hadoop clusters use commodity hardware, which are primarily x86 based servers. Hadoop clusters usually include a large amount of local disk (used for HDFS nodes) while many HPC clusters rely on NFS or a parallel file system for cluster-wide storage. In the case of HPC, there are diskless and diskfull nodes, but in terms of data storage, a separate group of hardware is often used for a global file storage. HDFS daemons run on all nodes and store data chunks locally. It is does not support the POSIX standard. Hadoop is designed to move the computation to the data, thus the need for HDFS to be distributed throughout the cluster.

In terms of networking, Hadoop clusters almost exclusively use Gigabit Ethernet (GigE). As the price continues to fall, newer systems are starting to adopt 10 Gigabit Ethernet (10GigE). Although, there are many GigE and 10GigE HPC clusters, InfiniBand is often the preferred network.

Many new HPC clusters are using some form of acceleration hardware on the nodes. These additions are primarily from NVidia (Kepler) and Intel (Phi). They require additional programming (in some cases) and can provide substantial speed-up for certain applications.

Resource Scheduling

One of the biggest differences between Hadoop and HPC systems is resource management. In HPC there is a fine grained control of what resources (cores, accelerators, memory, time, etc.) are given to users. These resources are scheduled with tools like Grid Engine, Moab, LoadLeveler, etc. With Hadoop, there is an integrated scheduler that consists of a master Job Tracker, which communicates with Task Trackers on the nodes. All MapReduce work is supervised by the Job Tracker. There are no other job types supported in Hadoop (Version 1).

One interesting difference between an HPC resource scheduler and the Hadoop Task Tracker is fault tolerance. HPC schedulers can detect down nodes and reschedule jobs (as an option) but if the job has not been checkpointing, it must start form the beginning. Hadoop, due to the nature of the MapReduce algorithm, can manage failure through the Job Tracker. Because the task tracker is aware of job placement and data location a failed node (or even a rack of nodes) can be managed at run-time. Thus, when a HDFS node fails, the Job tracker can reassign a task to a node where a redundant copy of the data exists. Similarly, if a map or reduce process fails, the job can be restarted on a new node.

The next generation scheduler for Hadoop is called YARN (Yet Another Resource Negotiator) and offers better scalability and more fine grained control over the job scheduling. Users can request "containers" for MapReduce and other jobs (possibly MPI) which are managed by individual per job Application Masters. With YARN the Hadoop scheduler starts to look like other resource managers, however, it will be backward compatible with many higher level Hadoop tools.

Programming

One of the big differences between Hadoop and HPC is the programming models. Most HPC applications are written in Fortran, C, C++ with the aid of MPI libraries. There are also CUDA based applications as well and those optimized for Intel Phi. The responsibility of the users is actually quite large. Application authors must manage communications, synchronization, I/O, debugging, and possibly checkpointing/restart operations. These tasks are often not easy to get right and can take significant time to implement correctly and efficiently.

Hadoop by offering the MapReduce paradigm only requires that the user create a map step and reduce step (and possibly some others, i.e. combiner). These tasks are devoid of all the minutia of HPC programming. The user only need concern themselves with these two tasks which can be easily debugged and tested using small files on single system. Hadoop also presents a single name space parallel file system (HDFS) to user. Hadoop was written in Java and has a low level interface to write and run MapReduce applications, but it also supports an interface (called Streams) that allows mappers and reducers to be written in any language. Above these language interfaces sit many high level tools such as Apache Pig, a scripting language for Hadoop MapReduce, and Apache Hive, a SQL like interface to Hadoop MapReduce. Many users operate using these and other higher level tools tools and may never actually write actual mappers and reducers. This situation is analogous to application users in HPC that never write actual MPI code.

Parallel Computing Model

MapReduce can be classified as a SIMD (Single Instruction Multiple Data) problem. Indeed, the map step is highly scalable because the same instructions are carried out over all data. Parallelism arises from breaking the data into independent parts. There can be no forward or backward dependencies (side effects) within a map step, that is, the map step may not change any data (even it's own). The reducer step is similar in that it applies the same reduction process to a different set of data (the results of the map step).

In general, the MapReduce model provides a functional programing model rather than procedural one. Similar to a functional language, MapReduce cannot change the input data as part of the mapper or reducer process, which is usually a large file. Such restrictions can at first be seen as inefficient, however, the lack of side effects allows for easy scalability and redundancy.

An HPC cluster on the other hand, can run SIMD and MIMD (Multiple Instruction Multiple Data) jobs. The programmer determines how to execute the parallel algorithm. As noted above, this added flexibility comes with addition responsibilities. There is no restriction, however for a user to create their own MapReduce application withing the framework of a typical HPC cluster.

Big Data Needs Big Solution

There is no doubt that Hadoop is useful when analyzing very large data files. The is no shortage of "Big Data" files in HPC and Hadoop has seen some crossover into some technical computing areas. There is BioPig which extends Apache Pig with a sequence analysis capability. There is also MR-MSPOLYGRAPH, a MapReduce Implementation of a Hybrid Spectral Library-Database Search Method for Large-scale Peptide Identification. In the case of MR-MSPOLYGRAPH, results demonstrated that, relative to the serial version, MR-MSPolygraph reduces the time to solution from weeks to hours, for processing tens of thousands of experimental spectra. There are other applications including Protein sequencing and linear algebra.

Provided your problem fits into the MapReduce framework, Hadoop is powerful way to operate on staggeringly large data sets. Since both the map and reduce step are user defined, highly complex operations can be encapsulated in these steps. Indeed, there is actually no hard requirement for a reducer step if all your work can be done in the map step.

The growth of Hadoop and the hardware on which it runs has been increasing. Certainly it can be seen as a subset of HPC offering a single yet powerful algorithm that has been optimized for a large numbers of commodity servers. There is even some cross-over into technical computing that may see further growth as things like YARN begin to give existing Hadoop clusters more HPC capabilities. Many companies are finding, Hadoop to be the new Corporate HPC for big data.