What is Apache Hadoop YARN and Why You Should Care | Apache Hadoop, Spark, and Big Data

From the elephant in the room department

Talk to most people about Apache™ Hadoop® and the conversation will quickly turn to using the MapReduce algorithm. MapReduce works quite well as a processing model for many types of problems. In particular, when multiple mapping process are used to span TBytes of data the power of a scalable Hadoop cluster becomes evident. In Hadoop version 1, the MapReduce process was one of two core components. The other component is the Hadoop Distributed File System (HDFS). Once data is stored and replicated in HDFS, the MapReduce process could move computational processes to the server on which specific data resides. The result is a very fast and parallel computational approach to problems with large amounts of data. But, MapReduce is not the whole story.

The Hadoop version 1 ecosystem is illustrated in Figure 1 below. The core components, HDFS and MapReduce, are shown in the gray boxes. Other tools and services are built on top of these components. Notice that the only form of distributed processing is in the MapReduce component. Part of this design includes a JobTracker service the manages and assigns cluster computing resources.

Haddop without YARN
Figure 1: Hadoop Version 1 Ecosystem

This model works well provided your problem fits cleanly into a MapReduce framework. If it does not, then using MapReduce is usually not the most efficient method for your application -- if it even works at all. To address this and another limitations, Hadoop version 2 has "demoted" the MapReduce process to an "application framework" that runs under the new YARN (Yet Another Resource Scheduler) resource manager. Thus, Hadoop version 2 is no longer offers a single MapReduce engine, but a more general cluster resource manager.

The Hadoop version 2 ecosystem with YARN is shown in Figure 2 below. Note that the ecosystem is virtually identical to that of Figure 1 except for the splitting of the MapReduce processing component into YARN and a MapReduce Framework. The relationships and functionality of those projects that are based on Hadoop version 1 MapReduce have remained. (e.g Hive still works on top of MapReduce). Also, the JobTracker in Version 1 is now the ResourceManger in YARN, which lives fully in the YARN component.

Hadoop with YARN
Figure 2: Hadoop Version 2 Ecosystem

The migration from a monolithic MapReduce engine in Hadoop Version 1 to the YARN general purpose parallel resource management system in Hadoop version 2 is changing the Hadoop landscape. Organizational data stored in HDFS, in addition to MapReduce processing, are now amenable to any type of processing. Developers are no longer forced to fit solutions into a MapReduce framework and can use other analysis frameworks or create their own.

Limitations of Hadoop Version 1

Hadoop version 1 had some known scalability limitations. Due to the monolithic design of the MapReduce engine where both resource management and job progress need to be tracked, the maximum cluster size is limited to about 4000 nodes. The number of concurrent tasks is limited to about 40,000. Another issue is the reliance on a single JobTracker process that is a single point of failure. If the JobTracker failed, all queued and running jobs would be killed.

In terms of efficiently, the biggest limitation is the hard partitioning of resources into map and reduce slots. There was no way to redistribute these resources so the cluster could be used more efficiently. Finally, there were many request for alternate paradigms and services, such as graph processing. To address these issues a major redesign was needed. Essentially, the monolithic MapReduce process was broken in two parts; a resource scheduler (YARN) and a generalized processing framework which includes MapReduce.

Hadoop Version 2 Features

In creating the new version, the developers preserved all the MapReduce capability found in version 1. Thus, there was no penalty for upgrading to version 2 because all version 1 MapReduce applications will work. Most applications are binary compatible and at worst in some rare cases, applications may need to be recompiled.

The YARN scheduler in version 2 also brought a more dynamic approach to resource containers (a resource container is a number of compute cores, usually one, and an amount of memory). In Hadoop version 1 these resource container were either "mapper slots" or "reducer slots." In version 2, they are generalized slots that are under dynamic control. Thus, most clusters see an immediate boost in efficiency when switching to Hadoop version 2.

The scalability issue has been resolved through the use of a dedicated resource scheduler -- YARN. The resource scheduler does not have any responsibility for running or monitoring job progress. Indeed, the YARN ResourceManger component does not care what type of job the user is running, it assigns resources and steps out of the way. This design also allows for a fail-over ResourceManger component so that it is no longer a single point of failure.

Because YARN is a general scheduler, support of non-MapReduce jobs is now a available to Hadoop clusters. Applications must work with and request resources from the ResourceManger, but the actual computation can be determined by the users needs and not the MapReduce data-flow. One such application is Spark, which can be thought of as a memory resident MapReduce. Typically Hadoop MapReduce will move computation to the node where the data resides on disk. It will also write intermediate results to the node disks. Spark bypasses the disk writes/reads and keeps everything in memory. In other words, Spark moves computation to where the data lives in memory thereby creating very fast and scalable applications.

New application frameworks are not limited to MapReduce-like applications. Other frameworks include things like Apache Giraph for a graph processing. There is even support for the Message Passing Interface (MPI) API within a YARN application.

Another aspect of the new Hadoop design is application agility. In version 1, any changes to the MapReduce process required an upgrade of the entire cluster. Even testing new MapReduce version required a separate cluster. In version 2, multiple MapReduce versions can run at the same time. This agility also applies to other applications that are developed to run under YARN. New versions can be tested on the same cluster, and with the same data, that is used for the production version. And finally, YARN also offers the user the ability to move away from Java as YARN applications are not required to be written in Java.

And Why You Should Care

There are several important points about Apache Hadoop YARN worth noting:

Hadoop is no longer a single MapReduce application engine.
Hadoop is a data processing ecosystem that provides a framework for processing any type of data
The concept of a "data lake" is now possible, where practically unlimited amounts of raw unstructured data can be stored for later or real-time analysis
The ability of Hadoop to perform ETL (Extract, Transform, and Load) at run-time, enables the developer to use growing array of possible analysis engines including MapReduce, graph processing, in-memory, custom, high performance computing, and others.

The addition of YARN to the Hadoop ecosystem now offers a flexible and powerful platform for data analysis and growth. Hadoop is no longer a "one trick pony" and provides a very robust and open computing environment that can scale in to the future.

Learning More

You can learn more about Apache Hadoop YARN from Apache Hadoop YARN LiveLessons (Video Training) and Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2.

Originally published on InformIT.