From the "How to be a Hadoop/Spark smarty pants" department

Say it with me, "Apache Hadoop® is platform for data lake computing." There is a lot to unpack in that sentence. Specifically, Hadoop used to be a monolithic MapReduce engine. Now it is starting point for many other types of data analytics projects. Spark is a very popular and powerful language (written in Scala) that has an API for Python, R, and Java. Data lakes are large repositories of raw data that live in the Hadoop Distributed File System (HDFS). Then there is a resource manager called YARN (Yet Another Resource Negotiator) that supports dynamic run-time resource usage and data locality (among other things). Where to begin?

The Hadoop/Spark/Other ecosystem has a lot of moving parts that can be confusing. Having taught many workshops, written books, and created video tutorials on this topic, I can list the most common questions that I get asked:

  • What is the big picture? How does all this fit together?
  • What tools are available? And which one should I use? Do I need to know Java?
  • How do I try Hadoop/Spark/Other tools, I don't have a cluster and I certainly don't want to install all the software?
  • On the other hand, How do install Hadoop/Spark/Other software on my laptop?
  • How do I get data from my RDMS into HDFS? Can I move it back?
  • How does X compare with Y, what about Z?"

The list could continue for quite a bit. The answers to these questions are possible without becoming a Java or Scala programmer and reading a 400 page book on Hadoop and/or Spark. Instead you may want to consider one or all of the following resources to quickly get started learning about Hadoop/Spark/Others. Over the years, these resources have been developed and refined by myself and some co-authors. There are, of course, other on-line resources for learning about these topics. Note that the following resources are under constant refinement (i.e. not an aging web tutorial) and provide additional tested content to help you get started. And, were possible resources are freely available.

In my approach, there is one important principle that guides each topic. I call it the hello.c method. That is, each topic starts from scratch and demonstrates how to manage the create/run/change loop for each tool. The examples are more robust than a simple hello.c application and offer a solid basis to expand your understanding (or even start building your own application). In addition, there is no need to hunt for resources because everything is included in the notes files.

The following are various pathways to learn about data analytics with Hadoop, Spark, and other tools. Each path can be traversed on its own and re-enforced with the other resources. One important feature for all the resources are annotated notes files that provide all the commands that are used in each resource. These files allow you to easily cut-and-paste and run the examples from the books, videos, and workshops mentioned below. Finally, I want to call out the Linux Hadoop Minimal virtual machine (LHM-VM) described below. This virtual machine can be run on most notebooks without overwhelming the system. It is a great way to try Hadoop, HDFS, Spark, Pig, Hive, Sqoop, Flume, and the Zeppelin web-GUI. (And, if you are confused by all those names, you are in the right place!)

For Book Learners

Self Paced Instructional Video

The recently updated Hadoop and Spark Fundamentals LiveLessons from Addison Wesley provides over 14 hours of training video. Major topics include Hadoop, HDFS, Spark, Pig, Hive, Sqoop, Flume, Ozzie, HBase, and the Zeppelin web-GUI. If your organization has an account, the videos are available on Safari. Like the books, all examples are described in the code and notes files.

On-line Workshops

The latest addition to the learn Hadoop and Spark arsenal are four live workshops. These are generally held every six weeks and offer a chance to ask me questions (on average, about 25 questions get answered per class). They are fast paced and designed to get you started quickly with little or no previous experience. All the programming classes include downloadable class notes (with examples) so students can try and build on class examples. Courses 1 and 2 can be taken out of order. Course 3 builds on courses 1 and 2. Course 4 builds-on and assumes competence with topics in courses 3, 2, and 1. (There is now central web page for all the classes with class notes and supporting material.)

  1. Apache Hadoop, Spark and Big Data Foundations - A great introduction and background on the Hadoop/Spark Big Data ecosystem. This non-programming introduction to Hadoop, Spark, HDFS, and MapReduce is great for both managers and programmers. It is highly recommended to take this course before the "Hands-on Introduction" below. (3 hours - 1 day)
  2. Practical Linux Command Line for Data Engineers and Analysts - Confused by the Linux command line, but need ot pull a file from the web into HDFS? This class provides all the background you need to get started with using the Linux command line for data analytics projects. Includes background on the LHM-VM (see below). (3 hours -1 day).
  3. Hands-on Introduction to Apache Hadoop and Spark Programming- A hands-on introduction to using Hadoop, Pig, Hive, Sqoop, Spark, and Zeppelin notebooks. The supporting files (including a DOS to Linux/HDFS Cheat-sheet) are available from this page. Includes a background on the LHM-VM (see below). (6 hours - 2 days).
  4. Scalable Data Science with Apache Hadoop and Spark - The final course in the series that walks though a real data science project (Predicting airline delays). The entire project has been placed in a Zeppelin web notebook and is available for download. This course brings together aspects of the other courses and relies on previously introduced material. (3 hours - 1 day)

LHM-VM: Hadoop/Spark/Other on Your Laptop

As mentioned above, one of the big "blockers" to learning data analytics with Hadoop and Spark is finding a resource to try examples and experiment. In the past, I have recommended the Hortonworks Hadoop Sandbox virtual machine. This resource provides an excellent full featured Hadoop/Spark installation based on the the Hortonworks HDP release (Hortonworks is an open source Hadoop support/packaging company that has combined with Cloudera,). The sandbox can be run as a virtual machine on a laptop or desktop. Unfortunately, running the sandbox requires a rather hefty laptop and can be very slow on older systems.

For this reason, the Linux Hadoop Minimal virtual machine (LHM-VM) was created so that even "small" laptops (4GB memory, two cores, 70GB disk space) could run basic Hadoop and Spark examples. The virtual machine is running a full version of Linux (CentOS 6) with many of the tools needed to run the examples in the above resources. In particular it will run everything that is covered in the "Hands-on" live course. You can download the LHM-VM from this page and learn more about it from the Installation Notes. To run the file you will need VirtualBox (recommended and tested on Linux, Mac Sierra, and Windows 10) you can try VMware (not fully tested).

Next Steps

With the help of the above resources, these is no reason not to learn Hadoop, Spark, and the rest of the ecosystem. As the amount on-line data continues to grow, scalable analytics are becoming increasing important and will need competent users, programmers and practitioners. I'm more than happy to help you get started (quickly).

[Udated 09/13/2019 to fix expired links and add new course]

You have no rights to post comments


Login And Newsletter

Create an account to access exclusive content, comment on articles, and receive our newsletters.


This work is licensed under CC BY-NC-SA 4.0

©2005-2023 Copyright Seagrove LLC, Some rights reserved. Except where otherwise noted, this site is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The Cluster Monkey Logo and Monkey Character are Trademarks of Seagrove LLC.