Hits: 3074

Moving beyond MapReduce, if that is your cup of tea

Recently Hadoop distribution provider Hortonworks, announced it was Extending Spark on YARN for Enterprise Hadoop. If you are not up to speed in the Hadoop world there are a few points of interest for HPC in this announcement. First, while Hadoop version 1 was synonymous with MapReduce, Hadoop version 2 has "demoted" MapReduce to an application framework that runs under the YARN (Yet Another Resource Scheduler) resource manager. Thus, Hadoop version 2 has opened up a Hadoop Cluster to many more applications other than MapReduce.

One such application is Apache Spark, which can be thought of as a memory resident MapReduce. Typically Hadoop MapReduce will minimize data transfer by moving computation to the nodes where the data resides (on disk). It will also write intermediate results to the local node disk. Spark bypasses the disk operation and keeps everything in memory. In other words, Spark moves computation to where the data lives in memory thereby creating very fast and scalable applications. This computation model is quite a bit different than those used in the typical HPC cluster. While Spark is still a "MapReduce" like tool, the ability of a Hadoop cluster to support many different processing models, makes it an interesting tool for large scale data analysis. The in-memory model offered by Spark may be of interest to HPC users who found that traditional MapReduce was too slow for their needs. Hadoop is no longer a "one trick pony."