Table of Contents

Welcome to the Effective Data Pipelines Series

This page provides many of the resources for books, videos and on-line trainings.

You can find more information on all current video and book titles and upcoming on-line trainings from O'Reilly.

Training Descriptions

Many of the trainings are run on a regular basis. Check the O'Reilly site for upcoming live events. Where possible, trainings are demonstrated using the freely available virtual machine. To facilitate continued exploration using the virtual machine, training notes (text files) available below.


About the Presenter

Douglas Eadline, began his career as Analytical Chemist with an interest in computer methods. Starting with the first Linux Cluster Beowulf How-to document, Doug has written instructional documents covering many aspects of Linux High Performance Computing (HPC) and scalable data analytics computing. Currently, Doug serves as editor of the ClusterMonkey.net website and was previously editor of ClusterWorld Magazine, and senior HPC Editor for Linux Magazine. He is also a writer and consultant to the scalable HPC/Analytics industry. His recent video tutorials and books include of the Hadoop Fundamentals LiveLessons (Addison Wesley) video, Hadoop 2 Quick Start Guide (Addison Wesley), High Performance Computing for Dummies (Wiley) and Practical Data Science with Hadoop and Spark (Co-author, Addison Wesley). Doug also designs high performance desk-side clusters for both HPC and data analytics.

In addition to the on-line trainings, Doug also teaches graduate level courses as part of two Masters in Data science programs:

These classes include more in-depth treatment and practical application of the scalable computing tools and examples I cover in the one-day trainings. Consider enrolling in one of these excellent Data Science programs.

Contact: deadline(you know what goes here)eadline(and here)org
Mast: @thedeadline@mast.hpc.social
Twitter: @thedeadline


Class Notes for Bash Programming Quick-start

(Updated 19-Apr-2023)

Class Notes for Hands-on Introduction to Apache Hadoop and Spark Programming

(Updated 13-Oct-2021)

Class Notes for Data Engineering at Scale with Apache Hadoop and Spark

(Updated 17-Dec-2020)

Class Notes for Up and Running with Kubernetes

(Updated 07-Sep-2022)

Class Notes for Getting Started with Kafka

(Updated 09-Aug-2022 - fixes typos)

Class Notes for Kafka Methods and Administration

(Update 20-Mar-2023 - some code and typo fixes)

Class Notes for Scalable PySpark for Data Science

(Update 07-Jan-2024)

Old Notes

Old Notes Files can be found here.

Zeppelin Notebook for Scalable Data Science with Hadoop and Spark

(Updated 15-Sep-2021)


Supporting Documents (Cheat Sheets)


Linux Hadoop Minimal (LHM) Virtual Machine Sandbox

Used for Hands-on, Command Line, and Scalable Data Science trainings above. Note: This VM can also be used for the Hadoop and Spark Fundamentals: LiveLessons video mentioned below.

VERSION 2-8.1: (Current)

(Updated Jan-25-2024) CentOS Linux 7.6, Anaconda 3:Python 3.7.4, R 3.6.0, Hadoop 3.3.0, Hive 3.1.2, Apache Spark 2.4.5, Derby 10.14.2.0, Zeppelin 0.8.2, Sqoop 1.4.7, Kafka 2.5.0, HBase 2.4.10, NiFi 1.17.0, KafkaEsque. Used in all current trainings.

Linux Hadoop Minimal Installation Instructions VERSION 2 (Read First)

For VirtualBox X86 PC, Mac, Linux Machines
For UTM Apple Mac M Machines

Cloudera-Hortonworks HDP Sandbox

The Cloudera-Hortonworks HDP Sandbox, a full featured Hadoop/Spark virtual machine that runs under Docker, VirtualBox, or VMWare. Please see Cloudera/Hortonworks HDP Sandbox for more information. Due to the number of applications the HDP Sandbox can require substantial resources to run.


Other Resources for all Classes


Contact

For further questions or help with the Linux Hadoop Minimal Virtual Machine please email: deadline(you know what goes here)eadline(and here)org


Unless otherwise noted, all training content, notes, and examples © Douglas Eadline 2019-2024 All rights reserved.