Can You Learn Hadoop and Spark in One Day? | Apache Hadoop, Spark, and Big Data

From the Princess Bride guide to Hadoop

Apache Hadoop and Spark have received a large amount of attention in recent years. Understanding what Hadoop and Spark bring to the Big Data "revolution" can be difficult because the ecosystem of tools and applications is quite vast, changing, and somewhat convoluted. If one applies Betteridge's Law to the above headline, the answer would certainly be "No" and the mystique of Hadoop, Spark, and Big Data may continue to persist for some time.

Because Cluster Monkey does not like hearing the word "No," we decided to interview our own editor, Douglas Eadline, who has been crushing the notion that Hadoop is difficult and complex by presenting a one day workshop called Apache Hadoop with Spark in One Day. As background, Eadline has, in addition to writing numerous articles and presentations, authored several books and videos on Hadoop and Big Data, his most recent Hadoop 2 Quick-Start Guide is the genesis for the "Hadoop in One Day" concept that continues to intrigue users, administrators, and managers from all sectors of the market.

By way of full disclosure, when not writing or consulting, Eadline shares his time as Editor of Cluster Monkey and assisting Basement Supercomputing with desk-side High Performance Computing (HPC) and Hadoop computing designs.

Cluster Monkey: First, you are known for your experience in the HPC community, why all this interest in Hadoop?

Douglas Eadline: At first, I became intrigued with Hadoop because of the scale. In HPC, the challenges of scale is what drives much of the industry (Think Top500 list). Then out of the corner of my eye I see these Hadoop clusters getting larger and larger. I start to wonder, how do they deal with the scale issue?

I looked more deeply into the topic and realized the machines were (initially) designed to be large monolithic MapReduce engines. Focusing on a single algorithm and large data sets certainly explained many of the design decisions and why Hadoop clusters look very different than HPC clusters.

Hadoop version 2 changes everything about "Hadoop" and basically moves it from a one-trick pony to a Big Data platform, which is why running "Spark on Hadoop" makes perfect sense, but let's get back to the question.

As I explored Hadoop further, I wanted to know what was so important about the "Hadoop approach to data." I mean, database technology is mature and powerful these days and yet there was a lot of technical buzz around Hadoop. I think I found the answer, by the way.

CM: Okay, what is the answer?

DE: It is all about the size, speed, and variability of online data, which continues to grow quite rapidly. Basically, Hadoop takes a lazy approach and works with raw data while more traditional approaches, like a data warehouse, perform an extract, transform, and load (ETL) step before data are made available for use. The ETL step takes place as part of the Hadoop application leaving original data intact and allowing processing to begin immediately or even in real-time. This capability is why people get excited about Hadoop.

CM: This seems to be the way HPC manages data to some extent. Why not just use HPC for Big Data?

DE: That is a whole other issue. If you really want to understand how Hadoop evolved, take a look at the first chapter of the Hadoop YARN book Apache Hadoop YARN: A Brief History and Rationale. HPC users will recognize some commonalities in the early versions of Hadoop and then understand how and why decisions were made to create a full Big Data platform.

CM: This topic sounds rather complicated, how can one possibly learn it all in one day?

DE: So let me be clear, I can't make you an expert in one day. What I can do is provide a focused overview of the Hadoop landscape and get you to the "hello-world.c point" of understanding for some important applications (like Spark).

CM: What is the "hello-world.c point" of understanding?

DE: In computer programming, there are usually small example programs that you learn when using a new programming language. The most famous is the classic C program that prints "hello world." Once a user can perform and understand the steps to get to this point they can continue to modify the simple program for their needs. I do the same thing with some important Hadoop tools. I can't do everything in one day, so I suggest the Quick-Start Guide (all participants get a copy) as a way to continue learning. My videos also follow a similar format. Plus, there are plenty of resources on the Internet.

CM: You seem to make a big deal about using a local cluster to teach this workshop. Why does that matter?

DE: Well this feature comes from several directions. First, using the Internet for instruction has always been problematic for me, and I like the Internet! If everything was run from the command line, it would probably be workable, but, I like to use the web based user interfaces, in particular the Zeppelin interface for Spark which is maturing quite nicely. I really dislike waiting during a presentation for a page to refresh or having some issue that is out of my control become a show stopper.

Second, Hadoop is a cluster application. Even though it can be installed on a single machine, the "distributed" point and experience is missed. For example a working multi-node Hadoop Distributed Filesystem (HDFS) cannot be reproduced on single laptop.. Being able to "see" how data are stored and applications run on a real cluster is part of understanding Hadoop.

The final direction was the creation of a Limulus Hadoop desk-side cluster. As you know, Basement Supercomputing has created desk-side HPC appliances that provide high performance within a low power, heat, and noise envelope. It did not take much to create a similar appliance that runs Hadoop, primarily by adding additional storage. These are real four node clusters that provide a true real-time cluster experience for workshop participants. Everything is local, there is no Internet with which to contend. Users connect their laptop via a local private wireless network.

CM: Why not just stream the workshop on the Internet and let people watch it at their leisure?

DE: That is certainly possible, but I'm a bit old-school when it comes to training. I like small classrooms that support interaction. The Hadoop with Spark in One Day workshop is designed that way. You get a high-touch hands-on Hadoop experience -- take a day, dive in and get what you need quickly and efficiently. And, have some fun at the same time!

CM: So if the workshop participants do not come out as Hadoop experts, what can they expect to gain from the workshop?

DE: Plenty. In my experience, having run this workshop several times, people come away with a solid understanding of the Hadoop ecosystem, which can be quite overwhelming at times. In addition, attendees get essential hands-on experience with some of the key tools. I already mentioned Spark, and I include Pig and Hive as well. Plus, I can point students in the direction of the right Hadoop tools to help solve their problem. We provide all workshop participants with a copy of the Quick-Start book so they can continue learning about things like Flume, Sqoop, HBase, etc. after the workshop. There is also a web-based Q&A board set up for questions and discussion after the workshop. I think one of the most interesting outcomes is a good understanding of what Hadoop is and isn't. It always reminds me of Inigo Montoya in the Princess Bride, "You keep using that word (Hadoop). I do not think it means what you think it means."

CM: What do participants have to say about the workshop? And where is it held?

DE: I have done the workshop in several academic settings around the country and it has been received quite well. There is a good balance of presentation and hands-on experience. The next workshop is in New York City in mid-May 2016, and we will add more dates and locations as interest develops. And, because the whole workshop can travel anywhere, thanks to a desk-side Hadoop cluster, it can be run almost anywhere -- including on-site at any organization.

CM: One final question. People seem to think Hadoop and Spark are two different things and are competitive projects. Is that the case? Will Hadoop get replaced by Spark?

DE: In a word "No." Recall that I mentioned that Hadoop version 2 is a platform for Big Data and Spark is just one of the many tools that can run as part of the platform. In my opinion, if you are going to do "Big Data" processing you will need more than a single tool to solve your problem. Hadoop is a management platform for data and offers a large array of capabilities and tools (including commercial applications) that can be used. Of course, the end-user will probably use a web UI, such as Zeppelin and never know (or need to know) all the important components that are running down in the engine room.

CM: I'm starting to think you can learn Hadoop and Spark in one day.

DE: You certainly can. There are plenty of attendees who will agree with you.