Live On-Line Training:
Scalable Data Pipelines with Hadoop, Spark, and Kafka

Training Descriptions
About the Presenter
Class Notes for Bash Programming Quick-start
Class Notes for Hands-on Introduction to Apache Hadoop and Spark Programming
Class Notes for Data Engineering at Scale with Apache Hadoop and Spark
Class Notes for Up and Running with Kubernetes
Class Notes for Getting Started with Kafka
Class Notes for Kafka Methods and Administration
Class Notes for Scalable PySpark for Data Science
Zeppelin Notebook for Scalable Data Science with Hadoop and Spark
Supporting Documents (Cheat Sheets)
Linux Hadoop Minimal (LHM) Virtual Machine Sandbox
Cloudera-Hortonworks HDP Sandbox
Other Resources for all Classes
Contact

Welcome to the Effective Data Pipelines Series

This page provides many of the resources for books, videos and on-line trainings.

You can find more information on all current video and book titles and upcoming on-line trainings from O'Reilly.

Training Descriptions

Many of the trainings are run on a regular basis. Check the O'Reilly site for upcoming live events. Where possible, trainings are demonstrated using the freely available virtual machine. To facilitate continued exploration using the virtual machine, training notes (text files) available below.

Apache Hadoop, Spark, and Kafka Foundations - (POPULAR) A great introduction to the Hadoop Big Data Ecosystem with Spark and Kafka. A non-programming introduction to Hadoop, Spark, HDFS, MapReduce, and Kafka. After completing the workshop attendees will gain a workable understanding of the Hadoop/Spark/Kafka technical value proposition and provide a solid background for following training in the Effective Data Pipelines Series (3 hours-1 day)
Bash Programming Quick-start for Data Science - Quickly learn the essentials of using the Linux command line for Data Science at scale. Download/upload files, run applications, monitor resources, edit files, write scripts, and navigate the Linux command line interface used on almost all modern analytics clusters. Students can download and run examples on the “Linux Hadoop Minimal” virtual machine, see below. (4 hours-1 day)
Hands-on Introduction to Apache Hadoop, Spark, and Kafka Programming - (POPULAR) A hands-on introduction to using Hadoop, Hive, Sqoop, Spark, Kafka and Zeppelin notebooks. Students can download and run examples on the “Linux Hadoop Minimal” virtual machine, see below. (6 hours-2 days)
Getting Started with Kafka - (POPULAR) Apache Kafka is designed to manage data flow by decoupling the data source from the destination. Kafka can provide a robust data buffer or broker that can help create and manage data pipelines. In this training, the basic Kafka data broker design and operation is explained and illustrated using both the command line and a GUI.
Kafka Methods and Administration - Additional Kafka features that go beyond those presented in Getting Started with Kafka will be addressed. These topics include writing to databases and HDFS, producer and consumer options, working with Kafka Connect, and Kafka installation and administration.
Up and Running with Kubernetes - Kubernetes can be considered a container operating system where application resource and storage needs are matched to an underlying cluster environment (either virtual or real). This course provides both background for users coupled with a practical hands-on introduction to Kubernetes.
Data Engineering at Scale with Apache Hadoop and Spark - As part of the Effective Data Pipelines series, this training provides background and examples on data “munging” or transforming raw data into a form that can be used with analytical modeling libraries. Also referred to as data wrangling, transformation, or ETL these techniques are often performed “at scale” on a real cluster using Hadoop and Spark.(3 hours-1 days)
Scalable Analytics with Apache Hadoop, Spark, and Kafka - A complete data science investigation requires different tools and strategies. In this training, learn How to apply Hadoop, Spark, and Kafka tools to Predict Airline Delays. All programming will be done using Hadoop, Spark, and Kafka with the Zeppelin web notebook on a four node cluster. The notebook will be made available for download so student can reproduce the examples. (3 hours-1 day)

About the Presenter

Douglas Eadline, began his career as Analytical Chemist with an interest in computer methods. Starting with the first Linux Cluster Beowulf How-to document, Doug has written instructional documents covering many aspects of Linux High Performance Computing (HPC) and scalable data analytics computing. Currently, Doug serves as editor of the ClusterMonkey.net website and was previously editor of ClusterWorld Magazine, and senior HPC Editor for Linux Magazine. He is also a writer and consultant to the scalable HPC/Analytics industry. His recent video tutorials and books include of the Hadoop Fundamentals LiveLessons (Addison Wesley) video, Hadoop 2 Quick Start Guide (Addison Wesley), High Performance Computing for Dummies (Wiley) and Practical Data Science with Hadoop and Spark (Co-author, Addison Wesley). Doug also designs high performance desk-side clusters for both HPC and data analytics.

In addition to the on-line trainings, Doug also teaches graduate level courses as part of two Masters in Data science programs:

DS575 Big Data Techniques as part of an on-line Masters in Data Science from Juniata College.
DSCI 411 Data Management for Big Data as part of in-person and online Masters in Data Science from Lehigh University

These classes include more in-depth treatment and practical application of the scalable computing tools and examples I cover in the one-day trainings. Consider enrolling in one of these excellent Data Science programs.

Contact: deadline(you know what goes here)eadline(and here)org
Mast: @thedeadline@mast.hpc.social
Twitter: @thedeadline

Class Notes for Bash Programming Quick-start

(Updated 19-Apr-2023)

First Steps for Bash Programming Training
Class Notes (tgz format)
Class Notes (zip format)

Class Notes for Hands-on Introduction to Apache Hadoop and Spark Programming

(Updated 13-Oct-2021)

Class Notes for Data Engineering at Scale with Apache Hadoop and Spark

(Updated 17-Dec-2020)

First Steps for Data Engineering Class
Class Notes (tgz format)
Class Notes (zip format)
Data Engineering at Scale Zeppelin Notebook (Right Click, Save Link As …)

Class Notes for Up and Running with Kubernetes

(Updated 07-Sep-2022)

Class Notes (tgz format)
Class Notes (zip format)

Class Notes for Getting Started with Kafka

(Updated 09-Aug-2022 - fixes typos)

First Steps for Getting Started with Kafka
Class Notes (tgz format)
Class Notes (zip format)
Additional note for running Kafkaesque on Apple M based systems (Linux Virtual Machines running on UTM)

Class Notes for Kafka Methods and Administration

(Update 20-Mar-2023 - some code and typo fixes)

First Steps for Kafka Methods and Administration
Class Notes (tgz format)
Class Notes (zip format)

Class Notes for Scalable PySpark for Data Science

(Update 07-Jan-2024)

First Steps for Scalable PySpark for Data Science
Class Notes (tgz format)
Class Notes (zip format)
PySpark for Data Science Zeppelin Notebook (Right Click, Save Link As …)

Old Notes

Old Notes Files can be found here.

Zeppelin Notebook for Scalable Data Science with Hadoop and Spark

(Updated 15-Sep-2021)

Scalable-Analytics-V2.1.json New version that uses Hive, Python, and PySpark
Scalable-Analytics.json Old Version that uses Pig, Python, and PySpark

Supporting Documents (Cheat Sheets)

Linux Hadoop Minimal (LHM) Virtual Machine Sandbox

Used for Hands-on, Command Line, and Scalable Data Science trainings above. Note: This VM can also be used for the Hadoop and Spark Fundamentals: LiveLessons video mentioned below.

VERSION 2-8.1: (Current)

(Updated Jan-25-2024) CentOS Linux 7.6, Anaconda 3:Python 3.7.4, R 3.6.0, Hadoop 3.3.0, Hive 3.1.2, Apache Spark 2.4.5, Derby 10.14.2.0, Zeppelin 0.8.2, Sqoop 1.4.7, Kafka 2.5.0, HBase 2.4.10, NiFi 1.17.0, KafkaEsque. Used in all current trainings.

Linux Hadoop Minimal Installation Instructions VERSION 2 (Read First)

For VirtualBox X86 PC, Mac, Linux Machines

Linux Hadoop Minimal V2.0-8.1MD5
Linux Hadoop Minimal Virtual Machine V2.0-8.1 OVA file US Europe (13.0G) NOTE: Chrome may prevent http downloads, right click the link, choose “Save Link As” then click “Keep” next to the blue discard box at the bottom of the browser.
Hadoop Minimal Build Notes x86 Virtual Box (tgz format)

For UTM Apple Mac M Machines

Linux Hadoop Minimal V2.0-M8.2.zip MD5
Linux Hadoop Minimal Virtual Machine V2.0-8.2 UTM file US Europe (8.0G) NOTE: Chrome may prevent http downloads, right click the link, choose “Save Link As” then click “Keep” next to the blue discard box at the bottom of the browser.
Hadoop Minimal Build Notes Mac UTM (tgz format)

Cloudera-Hortonworks HDP Sandbox

The Cloudera-Hortonworks HDP Sandbox, a full featured Hadoop/Spark virtual machine that runs under Docker, VirtualBox, or VMWare. Please see Cloudera/Hortonworks HDP Sandbox for more information. Due to the number of applications the HDP Sandbox can require substantial resources to run.

Other Resources for all Classes

Contact

For further questions or help with the Linux Hadoop Minimal Virtual Machine please email: deadline(you know what goes here)eadline(and here)org

Live On-Line Training: Scalable Data Pipelines with Hadoop, Spark, and Kafka

User Tools

Site Tools

Table of Contents