User Tools

Site Tools


start

This is an old revision of the document!


Welcome to the Effective Data Pipelines Series

(previously Scalable Analytics with Apache Hadoop and Spark)

The six essential trainings on the path to scalable data science pipelines nirvana–or at least a good start

Click on the training name for availability and further information. New trainings are always being added. For best results, trainings should be taken in the recommended order (shown below). Trainings 1 and (2&3) can be taken out of order. Training 4 builds on training 1 and (2&3). Training 5 builds-on and assumes competence with topics in Training 4, (3&2), and 1. Finally, Training 6 requires understanding of tools and topics in trainings 1-5.

NOTE: If the link does not lead you to the training, it has not yet been scheduled. Check back at a later date

1 Apache Hadoop, Spark, and Kafka Foundations: Effective Data Pipelines - A great introduction to the Hadoop Big Data Ecosystem with Spark and Kafka. A non-programming introduction to Hadoop, Spark, HDFS, MapReduce, and Kafka. After completing the workshop attendees will gain a workable understanding of the Hadoop/Spark/Kafka technical value proposition and provide a solid background for following training in the Effective Data Pipelines Series (3 hours-1 day)
2 Beginning Linux Command Line for Data Engineers and Analysts: Effective Data Pipelines - Quickly learn the essentials of using the Linux command line on Hadoop/Spark clusters. Download/upload files, run applications, monitor resources, and navigate the Linux command line interface used on almost all modern analytics clusters. Students can download and run examples on the “Linux Hadoop Minimal” virtual machine, see below. (3 hours-1 day)
3Intermediate Linux Command Line for Data Engineers and Analysts: Effective Data Pipelines - This training is a continuation of Beginning Linux Command Line for Data Engineers and Analysts covering more advanced topics. Coverage includes: Linux Analytics, Moving Data into Hadoop HDFS, Running Command Line Analytics Tools, Bash Scripting Basics, and Creating Bash Scripts
4 Hands-on Introduction to Apache Hadoop, Spark, and Kafka Programming - A hands-on introduction to using Hadoop, Hive, Sqoop, Spark, Kafka and Zeppelin notebooks. Students can download and run examples on the “Linux Hadoop Minimal” virtual machine, see below. (6 hours-2 days)
5 Data Engineering at Scale with Apache Hadoop and Spark - As part of the Effective Data Pipelines series, this training provides background and examples on data “munging” or transforming raw data into a form that can be used with analytical modeling libraries. Also referred to as data wrangling, transformation, or ETL these techniques are often performed “at scale” on a real cluster using Hadoop and Spark.(3 hours-1 days)
6 Scalable Analytics with Apache Hadoop, Spark, and Kafka - A complete data science investigation requires different tools and strategies. In this training, learn How to apply Hadoop, Spark, and Kafka tools to Predict Airline Delays. All programming will be done using Hadoop, Spark, and Kafka with the Zeppelin web notebook on a four node cluster. The notebook will be made available for download so student can reproduce the examples. (3 hours-1 day)

About the Presenter

Douglas Eadline, began his career as Analytical Chemist with an interest in computer methods. Starting with the first Linux Beowulf How-to document, Doug has written instructional documents covering many aspects of Linux HPC, Hadoop, and Analytics computing. Currently, Doug serves as editor of the ClusterMonkey.net website and was previously editor of ClusterWorld Magazine, and senior HPC Editor for Linux Magazine. He is also a writer and consultant to the scalable HPC/Analytics industry. His recent video tutorials and books include of the Hadoop Fundamentals LiveLessons (Addison Wesley) video, Hadoop 2 Quick Start Guide (Addison Wesley), High Performance Computing for Dummies (Wiley) and Practical Data Science with Hadoop and Spark (Co-author, Addison Wesley).

Update: I will be teaching DS575 Big Data Techniques as part of an on-line (remote) Masters in Data Science from Juniata College. The class will include more in-depth treatment and practical application of the scalable computing topics and examples I cover in these trainings. Consider enrolling in the Data Science program.


Class Notes for Beginning Linux Command Line for Data Engineers and Analysts

(Updated 09-Sep-2020)

Class Notes for Intermediate Linux Command Line for Data Engineers and Analysts

(Updated 04-Sep-2020)

Class Notes for Hands-on Introduction to Apache Hadoop and Spark Programming

(Updated 29-Sep-2020)

Class Notes for Data Engineering at Scale with Apache Hadoop and Spark

Class Notes for Up and Running with Kubernetes

(Updated 18-Jun-2020)

Class Notes for Implementing an Edge Computing Apache Kafka Inference Engine

(Updated 23-Jul-2020)

Old Notes

Zeppelin Notebook for Scalable Data Science with Hadoop and Spark

(Updated 09-Jul-2020)


Supporting Documents (Cheat Sheets)

Linux Hadoop Minimal (LHM) Virtual Machine Sandbox

Used for Hands-on, Command Line, and Scalable Data Science trainings above. Note: This VM can also be used for the Hadoop and Spark Fundamentals: LiveLessons video mentioned below.

VERSION 2-beta4: (Current)

(Updated 17-Dec-2020) CentOS Linux 7.6, Anaconda 3:Python 3.7.4, R 3.6.0, Hadoop 3.3.0, Hive 3.1.2, Apache Spark 2.4.5, Derby 10.14.2.0, Zeppelin 0.8.2, Sqoop 1.4.7, Kafka 2.5.0. Used in all current classes as of June 1, 2020.

VERSION 0.42: (Deprecated)

CentOS Linux 6.9, Apache Hadoop 2.8.1, Pig 0.17.0, Hive 2.3.2, Spark 1.6.3, Derby 10.13.1.1, Zeppelin 0.7.3, Sqoop 1.4.7, Flume-1.8.0. Used in previous classes.


Cloudera-Hortonworks HDP Sandbox

The Cloudera-Hortonworks HDP Sandbox, a full featured Hadoop/Spark virtual machine that runs under Docker, VirtualBox, or VMWare. Please see Cloudera/Hortonworks HDP Sandbox for more information. Due to the number of applications the HDP Sandbox can require substantial resources to run.


Other Resources for all Classes

Contact

For further questions or help with the Linux Hadoop Minimal Virtual Machine please email: deadline(you know what goes here)limulus-computing(and here)com

Unless otherwise noted, all training content, notes, and examples © Copyright Basement Supercomputing 2019, 2020 All rights reserved.

start.1608236523.txt.gz · Last modified: 2020/12/17 20:22 by deadline