Live On-Line Training:
Scalable Data Pipelines with Hadoop, Spark, and Kafka

This is an old revision of the document!

Course Descriptions and Links
Class Notes for Hands-on Introduction to Apache Hadoop and Spark Programming
Zeppelin Notebook for Scalable Data Science with Hadoop and Spark
DOS to Linux and Hadoop HDFS Help:
Linux Hadoop Minimal (LHM) Virtual Machine Sandbox
Cloudera-Hortonworks HDP Sandbox
Zeppelin Web Notebook
Other Resources for all Classes
Contact

Welcome to Scalable Analytics with Apache Hadoop and Spark

(The four essential courses on the path to scalable data science nirvana–or at least a good start)

Course Descriptions and Links

Click on the course name for availability and further information. For best results, courses should be taken in the recommended order (shown below). Courses 1 and 2 can be taken out of order. Course 3 builds on courses 1 and 2. Course 4 builds-on and assumes competence with topics in courses 3, 2, and 1.

1	Apache Hadoop, Spark and Big Data Foundations - A great introduction to the Hadoop Big Data Ecosystem. A non-programming introduction to Hadoop, Spark, HDFS, and MapReduce. (3 hours-1 day)
2	Practical Linux Command Line for Data Engineers and Analysts - Quickly learn the essentials of using the Linux command line on Hadoop/Spark clusters. Move files, run applications, write scripts and navigate the Linux command line interface used on almost all modern analytics clusters. Students can download and run examples on the “Linux Hadoop Minimal” virtual machine, see below. (3 hours-1 day)
3	Hands-on Introduction to Apache Hadoop and Spark Programming - A hands-on introduction to using Hadoop, Pig, Hive, Sqoop, Spark and Zeppelin notebooks. Students can download and run examples on the “Linux Hadoop Minimal” virtual machine, see below. (6 hours-2 days)
4	Scalable Data Science with Hadoop and Spark - Learn How to Apply Hadoop and Spark tools to Predict Airline Delays. All programming will be done using Hadoop and Spark with the Zeppelin web notebook on a four node cluster. The notebook will be made available for download so student can reproduce the examples. (3 hours-1 day)

Class Notes for Hands-on Introduction to Apache Hadoop and Spark Programming

(Updated 03-June-2019)

Class Notes (tgz format)
Class Notes (zip format)

Class Notes for Practical Linux Command Line for Data Engineers and Analysts

(Updated 19-Mar-2019)

Class Notes (tgz format)
Class Notes (zip format)

Zeppelin Notebook for Scalable Data Science with Hadoop and Spark

(Updated 19-Aug-2019)

Scalable-Analytics.json

DOS to Linux and Hadoop HDFS Help:

Linux Hadoop Minimal (LHM) Virtual Machine Sandbox

(Current Version 0.42, 03-June-2019) Not ready for Scalable Data Science with Hadoop and Spark (soon)

Used for Hands-on, Command Line, and Scalable Data Science courses above. Note: This VM can also be used for the Hadoop and Spark Fundamentals: LiveLessons video mentioned below.

Linux Hadoop Minimal Installation Instructions (Read First)
Linux Hadoop Minimal MD5
Linux Hadoop Minimal Virtual Machine OVA file US Europe (3.3G)
Old Versions

Cloudera-Hortonworks HDP Sandbox

The Cloudera-Hortonworks HDP Sandbox, a full featured Hadoop/Spark virtual machine that runs under Docker, VirtualBox, or VMWare. Please see Cloudera/Hortonworks HDP Sandbox for more information. Due to the number of applications the HDP Sandbox can require substantial resources to run.

Zeppelin Web Notebook

For those taking the Scalable Data Science course a 30-day web-based Zeppelin Notebook is available from Basement Supercomputing. Please use the Sign Up Form to get access to the notebook.

Other Resources for all Classes

Contact

For further questions or help with the Linux Hadoop Minimal Virtual Machine please email d...@b...g.com

Live On-Line Training: Scalable Data Pipelines with Hadoop, Spark, and Kafka

User Tools

Site Tools

Table of Contents