Apache Hadoop, Spark, and Big Data

Apache Hadoop is a platform for managing large amounts of data. It includes many tools and applications under one framework.

From the "How to be a Hadoop/Spark smarty pants" department

Say it with me, "Apache Hadoop® is platform for data lake computing." There is a lot to unpack in that sentence. Specifically, Hadoop used to be a monolithic MapReduce engine. Now it is starting point for many other types of data analytics projects. Spark is a very popular and powerful language (written in Scala) that has an API for Python, R, and Java. Data lakes are large repositories of raw data that live in the Hadoop Distributed File System (HDFS). Then there is a resource manager called YARN (Yet Another Resource Negotiator) that supports dynamic run-time resource usage and data locality (among other things). Where to begin?

Ever wonder what Edge computing is all about? Data happens and information takes work. Estimates are that by 2020, 1.7 megabytes of new data will be created every second for every person in the world. That is a lot of raw data.

Two questions come to mind. What are we going to do with it and where we going to keep it. Big Data is often described by the three Vs – Volume, Velocity, and Variability – and note not all three need apply. What is missing is the letter “U” which stands for Usability. A Data Scientist will first ask, how much of my data is usable? Data usability can take several forms and include things like quality (is it noisy, incomplete, accurate) and pertinence (is there any extraneous information that will not make a difference to my analysis). There is also the issue of timeliness. Is there a “use by” date for the analysis or might the data be needed in the future for some as of yet unknown reason. The usability component is hugely important and often determines the size of any scalable analytics solution. Usable data is not the same as raw data.

Get the full article at The Next Platform. You may recognize the author.

From the "Here comes the cluestick" department

Apache Hadoop has been in the press lately. Some of the content has not been positive and, often times, reflects a misunderstanding of how Hadoop relates to data processing. Indeed, we seem to be in the Trough of Disillusionment in the Technology Hype Cycle. In my opinion, many of these recent "insights" seem to come from the belief that Hadoop is some kind of toaster. For the record, Hadoop can make great toast and as it traveled up the hype curve, market exuberance thought good tasting toast could do anything. Turns out, these days, people want something to go along with their toast. What happened to the great toast?

Nothing happened to the toast. It turns out that Hadoop may have started out as a toaster, but now it is quite a bit more. Hadoop has evolved into a full kitchen. To understand modern Hadoop technology, one must understand that just like kitchen that is designed to prepare food for consumption, Hadoop is designed as a platform to prepare data for analysis and insights.

(or How to Avoid the Trough of Disillusionment)

A recent blog post, Why not so Hadoop?, is worth reading if you are interested in big data analytics, Hadoop, Spark, and all that. The article contains the 2015 Gartner Hype Cycle. The 2016 version is worth examining as well. Some points similar to the blog can be made here:

  1. Big data was at the "Trough of Disillusionment" stage in 2014, but is not seen in the 2015/16 Hype cycle.
  2. The "Internet of Things" (a technology that is expected to fill the big data pipeline) was on the peak for two years and now has been given "platform status."

From the Princess Bride guide to Hadoop

Apache Hadoop and Spark have received a large amount of attention in recent years. Understanding what Hadoop and Spark bring to the Big Data "revolution" can be difficult because the ecosystem of tools and applications is quite vast, changing, and somewhat convoluted. If one applies Betteridge's Law to the above headline, the answer would certainly be "No" and the mystique of Hadoop, Spark, and Big Data may continue to persist for some time.

Because Cluster Monkey does not like hearing the word "No," we decided to interview our own editor, Douglas Eadline, who has been crushing the notion that Hadoop is difficult and complex by presenting a one day workshop called Apache Hadoop with Spark in One Day. As background, Eadline has, in addition to writing numerous articles and presentations, authored several books and videos on Hadoop and Big Data, his most recent Hadoop 2 Quick-Start Guide is the genesis for the "Hadoop in One Day" concept that continues to intrigue users, administrators, and managers from all sectors of the market.

By way of full disclosure, when not writing or consulting, Eadline shares his time as Editor of Cluster Monkey and assisting Basement Supercomputing with desk-side High Performance Computing (HPC) and Hadoop computing designs.


Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.