Hadoop Is Not a Toaster | Apache Hadoop, Spark, and Big Data

From the "Here comes the cluestick" department

Apache Hadoop has been in the press lately. Some of the content has not been positive and, often times, reflects a misunderstanding of how Hadoop relates to data processing. Indeed, we seem to be in the Trough of Disillusionment in the Technology Hype Cycle. In my opinion, many of these recent "insights" seem to come from the belief that Hadoop is some kind of toaster. For the record, Hadoop can make great toast and as it traveled up the hype curve, market exuberance thought good tasting toast could do anything. Turns out, these days, people want something to go along with their toast. What happened to the great toast?

Nothing happened to the toast. It turns out that Hadoop may have started out as a toaster, but now it is quite a bit more. Hadoop has evolved into a full kitchen. To understand modern Hadoop technology, one must understand that just like kitchen that is designed to prepare food for consumption, Hadoop is designed as a platform to prepare data for analysis and insights.

Apache Hadoop Version 1: The Toaster

First, for those that don't know, Hadoop is an open source (part of the Apache Foundation) project started at Yahoo that was originally designed to index the web. It uses MapReduce (a powerful SIMD approach to parallel computing). Like many software projects, it has grown and evolved over time. Prior to version 2, Hadoop was a monolithic parallel MapReduce engine. It worked by slicing (striping) large data files across storage nodes using the Hadoop Distributed File System (HDFS). The Map part of MapReduce would run the same mapping function on these slices (Thus the SIMD designation).

There were many tools built on top of MapReduce in version 1. Most notably is Apache Hive—an SQL for data in HDFS. Large amounts of data could now be managed and queried with well-known relational database methods. While some other tools can boast of this capability, Hadoop offers an altogether new way of managing data.

A typical data warehouse uses an Extract Transform and Load (ETL) step (schema on write) when adding data to the database (i.e. the database tables must be known before writing the data). Hadoop works differently because there are no assumptions about the data. Each Hadoop application usually consists of several component steps, one of which is usually some form of ETL. In some cases, the "non-relatable" data may not even be amenable to a traditional ETL step (e.g. twitter streams). A Hadoop application may have an ETL step that uses a model to classify data that can be then placed into Hive tables (e.g. is a tweet about a product "good, neutral, or bad"). Without going too far out into the weeds, the "Hadoop approach" provides a more scalable and flexible approach to data analysis (schema on read).

Apache Hadoop Version 2: The Kitchen

The advent of version 2 broke the monolithic Hadoop engine into two pieces:

A resource scheduler (called YARN, Yet Another Resource Negotiator)
An independent MapReduce engine.

Together these components provided the same capabilities as version 1 but offered much more flexibility. MapReduce still operated on the HDFS slices, but YARN was now an independent scheduler designed for dynamic workloads. Most resource managers are designed to manage a fixed amount of resources per job, while YARN can manage dynamic resource loads within a job. In addition, YARN also provides "data locality" as a resource (in addition to things like cores, memory, GPUs, or software licenses).

The independence of YARN allows it to be used by other tools. For instance, on many Hadoop clusters, Apache Spark execution is managed by YARN. Even though Spark does "in memory" processing, the data still reside in HDFS, which is designed as a distributed streaming "write-once read-many" file system. There is also an optimized MapReduce engine called "Tez" that significantly accelerates tools like Hive. There are many other tools that use YARN to run applications on a Hadoop Cluster.

Sitting above all these tools are many other applications that work to manage a "data pipeline." A very popular tool is Apache Kafka that provides the capability to build real-time streaming data pipelines that reliably get data between systems or applications. In addition, data governance tools are available and are critical to many production environments.

There are plenty more applications and tools at various levels. The point is clear; however, Hadoop is not a single isolated tool like a toaster. It is now a platform or a kitchen on/in which to create things. Your meal may include toast, but there are many more options available to today's chefs (who may actually moonlight as data engineers and scientists). And, if you keep thinking that Hadoop is a toaster, all you will ever get is toast.

The toaster to kitchen analogy is not unique to Hadoop. In a similar vein, the Linux operating system went through a hype cycle. There was a time when a grey beard (with optional pony tail) and the ability to boot Linux would get you an audience with a venture capitalist. There were all kinds of Linux shows, conferences, and publications (I used to write for one). Linux grew mature and turned into a kitchen. Very few write about or hype the "Linux toaster" anymore (except the chef/developer types). The meals are what matter now.

What About the Doom and Gloom

And what about the coalescing of Hadoop companies? Do yourself a favor and separate the "funding hype and speculation" from the technical achievements. Again, back in the hype days of Linux, there were many companies that had tens of millions invested in them because they had a new Linux toaster. Many of these companies don't exist anymore. We still have the Linux kitchen; however, thanks to open source, a strong community, and companies that know how to make good kitchens, we can get on with our recipes. Hadoop offers the same capabilities.

And next time you hear "X technology is dead," remember such binary (black and white) statements are oversimplifications from people who are still looking for toast in a kitchen that is full of other delicious food (Not that there is any wrong with toast).

Learning more about the Kitchen

For those that want learn more about Hadoop, Spark, and Data Science, I invite you to try one of the on-line courses that I teach. They are short, to the point, and provide you with quick start on the Hadoop/Spark ecosystem. There is also a Cluster Monkey article that presents some additional options for learning more about Hadoop. Of course, the web is replete with other resources.

In closing, I also highly recommend reading Hadoop is Dead. Long Live Hadoop by Arun Murthy. As one of the principal Hadoop developers, Arun provides important insight into Hadoop and the philosophy of the Hadoop ecosystem (kitchen). Bon Appétit.