Apache Hadoop and Big Data

Apache Hadoop is a platform for managing large amounts of data. It includes many tools and applications under one framework.

From the "How to be a Hadoop/Spark smarty pants" department

Say it with me, "Apache Hadoop® is platform for data lake computing." There is a lot to unpack in that sentence. Specifically, Hadoop used to be a monolithic MapReduce engine. Now it is starting point for many other types of data analytics projects. Spark is a very popular and powerful language (written in Scala) that has an API for Python, R, and Java. Data lakes are large repositories of raw data that live in the Hadoop Distributed File System (HDFS). Then there is a resource manager called YARN (Yet Another Resource Negotiator) that supports dynamic run-time resource usage and data locality (among other things). Where to begin?

(or How to Avoid the Trough of Disillusionment)

A recent blog post, Why not so Hadoop?, is worth reading if you are interested in big data analytics, Hadoop, Spark, and all that. The article contains the 2015 Gartner Hype Cycle. The 2016 version is worth examining as well. Some points similar to the blog can be made here:

  1. Big data was at the "Trough of Disillusionment" stage in 2014, but is not seen in the 2015/16 Hype cycle.
  2. The "Internet of Things" (a technology that is expected to fill the big data pipeline) was on the peak for two years and now has been given "platform status."

From the Princess Bride guide to Hadoop

Apache Hadoop and Spark have received a large amount of attention in recent years. Understanding what Hadoop and Spark bring to the Big Data "revolution" can be difficult because the ecosystem of tools and applications is quite vast, changing, and somewhat convoluted. If one applies Betteridge's Law to the above headline, the answer would certainly be "No" and the mystique of Hadoop, Spark, and Big Data may continue to persist for some time.

Because Cluster Monkey does not like hearing the word "No," we decided to interview our own editor, Douglas Eadline, who has been crushing the notion that Hadoop is difficult and complex by presenting a one day workshop called Apache Hadoop with Spark in One Day. As background, Eadline has, in addition to writing numerous articles and presentations, authored several books and videos on Hadoop and Big Data, his most recent Hadoop 2 Quick-Start Guide is the genesis for the "Hadoop in One Day" concept that continues to intrigue users, administrators, and managers from all sectors of the market.

By way of full disclosure, when not writing or consulting, Eadline shares his time as Editor of Cluster Monkey and assisting Basement Supercomputing with desk-side High Performance Computing (HPC) and Hadoop computing designs.

From the elephant in the room department

Hadoop Logo Talk to most people about Apache™ Hadoop® and the conversation will quickly turn to using the MapReduce algorithm. MapReduce works quite well as a processing model for many types of problems. In particular, when multiple mapping process are used to span TBytes of data the power of a scalable Hadoop cluster becomes evident. In Hadoop version 1, the MapReduce process was one of two core components. The other component is the Hadoop Distributed File System (HDFS). Once data is stored and replicated in HDFS, the MapReduce process could move computational processes to the server on which specific data resides. The result is a very fast and parallel computational approach to problems with large amounts of data. But, MapReduce is not the whole story.

Search

Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.

Feedburner

Share The Bananas


Creative Commons License
©2005-2018 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.