Live On-Line Training:
Scalable Data Pipelines with Hadoop, Spark, and Kafka

This is an old revision of the document!

Welcome to the Scalable Analytics with Apache Hadoop and Spark

*(The four essential courses on the path to scalable data science nirvana–or at least a good start)**

Course Descriptions and Links

Click on the course name for availability and further information. For best results, courses should be taken in the recommended order (shown below). Courses 1 and 2 can be taken out of order. Course 3 builds on course 1 and 2. Course 4 builds on course 3, 2, and 1.

1 Apache Hadoop, Spark and Big Data Foundations - A great introduction to the Hadoop Big Data Ecosystem. A non-programming introduction to Hadoop, Spark, HDFS, and MapReduce. (3 hours- 1 day)

2 Practical Linux Command Line for Data Engineers and Analysts - Quickly learn the essentials of using the Linux command line on Hadoop/Spark clusters. Move files, run applications, write scripts and navigate the Linux command line interface used on almost all modern analytics clusters. (3 hours - 1 Day)

3 Hands-on Introduction to Apache Hadoop and Spark Programming - A hands-on introduction to using Hadoop, Pig, Hive, Sqoop, Spark and Zeppelin notebooks. Students can download and run examples on a “Hadoop Minimal” virtual machine. (6 hours - 2 days).

4 Scalable Data Science with Hadoop and Spark - Learn How to Apply Hadoop and Spark tools to Predict Airline Delays. All programming will be done using Hadoop and Spark with the Zeppelin web notebook on a four node cluster. The notebook will be made available for download so student can reproduce the examples. (3 hours- 1 day)

Class Notes for Hands-on Introduction to Apache Hadoop and Spark Programming

(Updated 03-June-2019)

Class Notes (tgz format)
Class Notes (zip format)

Class Notes for Practical Linux Command Line for Data Engineers and Analysts

(Updated 19-Mar-2019)

Class Notes (tgz format)
Class Notes (zip format)

DOS to Linux and Hadoop HDFS Help:

Linux Hadoop Minimal Virtual Machine Current Version 0.42

(Updated 03-June-2019) Note: This VM can also be used for the Hadoop and Spark Fundamentals: LiveLessons video mentioned below.

Linux Hadoop Minimal Installation Instructions (Read First)
Linux Hadoop Minimal MD5
Linux Hadoop Minimal Virtual Machine OVA file (3.3G in size)
old versions

<H3> Other Resources for <I>Hands-on Introduction to Apache Hadoop and Spark Programming</I>:</H3> <UL> <LI>Book: <a href=“https://www.clustermonkey.net/Hadoop2-Quick-Start-Guide/”> Hadoop® 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop® 2 Ecosystem</a></LI> <LI>Video Tutorial: <a href=“https://www.safaribooksonline.com/library/view/hadoop-and-spark/9780134770871”>Hadoop and Spark Fundamentals: LiveLessons</a></LI> <LI>Book: <a href=“https://www.clustermonkey.net/Practical-Data-Science-with-Hadoop-and-Spark/”> Practical Data Science with Hadoop® and Spark: Designing and Building Effective Analytics at Scale</a></LI> </UL> <H3> Contact</H3> <UL> <LI>For further questions or help with the Linux Hadoop Minimal Virtual Machine please email <a href=“http://scr.im/4502”>d…@b…g.com</a></LI> </UL> <HR> <P>Unless otherwise noted, all course content, notes, and examples © Copyright Basement Supercomputing 2019, All rights reserved.

Live On-Line Training: Scalable Data Pipelines with Hadoop, Spark, and Kafka

User Tools

Site Tools

Table of Contents