Apache Hadoop, Spark, and Big Data

Apache Hadoop is a platform for managing large amounts of data. It includes many tools and applications under one framework.

From the "How to be a Hadoop/Spark smarty pants" department

Say it with me, "Apache Hadoop® is platform for data lake computing." There is a lot to unpack in that sentence. Specifically, Hadoop used to be a monolithic MapReduce engine. Now it is starting point for many other types of data analytics projects. Spark is a very popular and powerful language (written in Scala) that has an API for Python, R, and Java. Data lakes are large repositories of raw data that live in the Hadoop Distributed File System (HDFS). Then there is a resource manager called YARN (Yet Another Resource Negotiator) that supports dynamic run-time resource usage and data locality (among other things). Where to begin?

Ever wonder what Edge computing is all about? Data happens and information takes work. Estimates are that by 2020, 1.7 megabytes of new data will be created every second for every person in the world. That is a lot of raw data.

Two questions come to mind. What are we going to do with it and where we going to keep it. Big Data is often described by the three Vs – Volume, Velocity, and Variability – and note not all three need apply. What is missing is the letter “U” which stands for Usability. A Data Scientist will first ask, how much of my data is usable? Data usability can take several forms and include things like quality (is it noisy, incomplete, accurate) and pertinence (is there any extraneous information that will not make a difference to my analysis). There is also the issue of timeliness. Is there a “use by” date for the analysis or might the data be needed in the future for some as of yet unknown reason. The usability component is hugely important and often determines the size of any scalable analytics solution. Usable data is not the same as raw data.

Get the full article at The Next Platform. You may recognize the author.

(or How to Avoid the Trough of Disillusionment)

A recent blog post, Why not so Hadoop?, is worth reading if you are interested in big data analytics, Hadoop, Spark, and all that. The article contains the 2015 Gartner Hype Cycle. The 2016 version is worth examining as well. Some points similar to the blog can be made here:

  1. Big data was at the "Trough of Disillusionment" stage in 2014, but is not seen in the 2015/16 Hype cycle.
  2. The "Internet of Things" (a technology that is expected to fill the big data pipeline) was on the peak for two years and now has been given "platform status."

From the Princess Bride guide to Hadoop

Apache Hadoop and Spark have received a large amount of attention in recent years. Understanding what Hadoop and Spark bring to the Big Data "revolution" can be difficult because the ecosystem of tools and applications is quite vast, changing, and somewhat convoluted. If one applies Betteridge's Law to the above headline, the answer would certainly be "No" and the mystique of Hadoop, Spark, and Big Data may continue to persist for some time.

Because Cluster Monkey does not like hearing the word "No," we decided to interview our own editor, Douglas Eadline, who has been crushing the notion that Hadoop is difficult and complex by presenting a one day workshop called Apache Hadoop with Spark in One Day. As background, Eadline has, in addition to writing numerous articles and presentations, authored several books and videos on Hadoop and Big Data, his most recent Hadoop 2 Quick-Start Guide is the genesis for the "Hadoop in One Day" concept that continues to intrigue users, administrators, and managers from all sectors of the market.

By way of full disclosure, when not writing or consulting, Eadline shares his time as Editor of Cluster Monkey and assisting Basement Supercomputing with desk-side High Performance Computing (HPC) and Hadoop computing designs.

From the elephant in the room department

Hadoop Logo Talk to most people about Apache™ Hadoop® and the conversation will quickly turn to using the MapReduce algorithm. MapReduce works quite well as a processing model for many types of problems. In particular, when multiple mapping process are used to span TBytes of data the power of a scalable Hadoop cluster becomes evident. In Hadoop version 1, the MapReduce process was one of two core components. The other component is the Hadoop Distributed File System (HDFS). Once data is stored and replicated in HDFS, the MapReduce process could move computational processes to the server on which specific data resides. The result is a very fast and parallel computational approach to problems with large amounts of data. But, MapReduce is not the whole story.

Search

Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.

Feedburner

Share The Bananas


Creative Commons License
©2005-2019 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.