User Tools

Site Tools


start

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
start [2019/07/16 15:55]
deadline
start [2020/07/10 03:19]
deadline commented out notebook
Line 1: Line 1:
-=====Welcome to Scalable Analytics with Apache Hadoop and Spark=====+=====Welcome to the Effective Data Pipelines Series ===== 
 +(previously Scalable Analytics with Apache Hadoop and Spark)
  
-**(The four essential courses on the path to +**The six essential trainings on the path to 
-scalable data science nirvana--or at least a good start)**+scalable data science pipelines nirvana--or at least a good start**
  
-====Course Descriptions and Links====+====Training Descriptions and Links====
  
-Click on the course name for availability and further information. For best results, courses should be taken in the recommended order (shown below).  Courses 1 and 2 can be taken out of order. Course 3 builds on course 1 and 2. Course 4 builds on course 3, 2, and 1.  +Click on the training name for availability and further information. New trainings are always being added. For best results, trainings should be taken in the recommended order (shown below).  Trainings 1 and (2&3) can be taken out of order. Training 4 builds on training 1 and (2&3)Training 5 builds-on and assumes competence with topics in Training 4(3&2), and 1. Finally, Training 6 requires understanding of tools and topics in trainings 1-5.
  
-| 1 | [[https://www.safaribooksonline.com/search/?query=Apache%20Hadoop%2C%20Spark%20and%20Big%20Data%20Foundations&field=title|Apache Hadoop, Spark and Big Data Foundations]] - A great introduction to the Hadoop Big Data Ecosystem. A non-programming introduction to Hadoop, Spark, HDFS, and MapReduce. (3 hours-1 day)|{{wiki:foundations-course.png}}| +**NOTE: If the link does not lead you to the training, it has not yet been scheduled. Check back at a later date** 
-| 2 |[[https://www.oreilly.com/search/?query=Practical%20Linux%20Command%20Line%20for%20Data%20Engineers%20and%20Analysts%20EadlinePractical Linux Command Line for Data Engineers and Analysts]] - Quickly learn the essentials of using the Linux command line on Hadoop/Spark clusters. Move files, run applications, write scripts and navigate the Linux command line interface used on almost all modern analytics clusters. Students can download and run examples on the "Linux Hadoop Minimal" virtual machine, see below. (3 hours-1 day)|{{wiki: command-line-course.png}}| + 
-| 3 |[[https://www.safaribooksonline.com/search/?query=Hands-on%20Introduction%20to%20Apache%20Hadoop%20and%20Spark%20Programming&field=title|Hands-on Introduction to Apache Hadoop and Spark Programming]] - A hands-on introduction to using Hadoop, Pig, Hive, Sqoop, Spark and Zeppelin notebooks. Students can download and run examples on the "Linux Hadoop Minimal" virtual machine, see below. (6 hours-2 days)|{{wiki: hands-on-course.png}}| +| 1 | [[https://www.oreilly.com/search/?query=Apache%20Hadoop%2C%20Spark%2C%20and%20Kafka%20Foundations%3A%20Effective%20Data%20Pipelines&extended_publisher_data=true&field=title|Apache Hadoop, Sparkand Kafka Foundations: Effective Data Pipelines]] - A great introduction to the Hadoop Big Data Ecosystem with Spark and Kafka. A non-programming introduction to Hadoop, Spark, HDFS, MapReduce, and KafkaAfter completing the workshop attendees will gain a workable understanding of the Hadoop/Spark/Kafka technical value proposition and provide a solid background for following training in the Effective Data Pipelines Series (3 hours-1 day)|{{:wiki:oreilly-logo-foundations-dp.png?400}}| 
-4| [[https://www.oreilly.com/search/?query=Scalable%20Data%20Science%20with%20Hadoop%20and%20Spark%20Eadline|Scalable Data Science with Hadoop and Spark]] - Learn How to Apply Hadoop and Spark tools to Predict Airline Delays. All programming will be done using Hadoop and Spark with the Zeppelin web notebook on a four node cluster. The notebook will be made available for download so student can reproduce the examples. (3 hours-1 day)|{{wiki: scalable-DS-course.png}}|+| 2 |[[https://www.oreilly.com/search/?query=Beginning%20Linux%20Command%20Line%20for%20Data%20Engineers%20and%20Analysts%3A%20Effective%20Data%20Pipelines&extended_publisher_data=true&field=titleBeginning Linux Command Line for Data Engineers and Analysts: Effective Data Pipelines]] - Quickly learn the essentials of using the Linux command line on Hadoop/Spark clusters. Download/upload files, run applications, monitor resources, and navigate the Linux command line interface used on almost all modern analytics clusters. Students can download and run examples on the "Linux Hadoop Minimal" virtual machine, see below. (3 hours-1 day)|{{:wiki:oreilly-begin-command-line-dp-logo.png?400}}| 
 +|3|[[https://www.oreilly.com/search/?query=Intermediate%20Linux%20Command%20Line%20for%20Data%20Engineers%20and%20Analysts%3A%20Effective%20Data%20Pipelines&extended_publisher_data=true&field=title|Intermediate Linux Command Line for Data Engineers and Analysts: Effective Data Pipelines]] - This training is a continuation of Beginning Linux Command Line for Data Engineers and Analysts covering more advanced topics. Coverage includes: Linux Analytics, Moving Data into Hadoop HDFS, Running Command Line Analytics Tools, Bash Scripting Basics, and Creating Bash Scripts|{{:wiki:oreilly-inter-command-line-dp-logo.png?400}}| 
 +| 4 |[[https://www.safaribooksonline.com/search/?query=Hands-on%20Introduction%20to%20Apache%20Hadoop%20Spark%20and%20Kafka%20Programming&field=title|Hands-on Introduction to Apache HadoopSpark, and Kafka Programming]] - A hands-on introduction to using Hadoop, Hive, Sqoop, Spark, Kafka and Zeppelin notebooks. Students can download and run examples on the "Linux Hadoop Minimal" virtual machine, see below. (6 hours-2 days)|{{:wiki:oreilly-hands-on-logo.png?400|}}| 
 +|[[https://www.oreilly.com/search/?query=Data%20Engineering%20at%20Scale%20with%20Apache%20Hadoop%20and%20Spark%3A%20Effective%20Data%20Pipelines&extended_publisher_data=true&field=title|Data Engineering at Scale with Apache Hadoop and Spark]] - As part of the Effective Data Pipelines series, this training provides background and examples on data "munging" or transforming raw data into a form that can be used with analytical modeling libraries. Also referred to as data wrangling, transformation, or ETL these techniques are often performed "at scale" on a real cluster using Hadoop and Spark.(3 hours-1 days)|{{:wiki:oreilly-data-eng-at-scale-dp-logo.png?400|}}| 
 +| 6| [[https://www.oreilly.com/search/?query=Scalable%20Analytics%20with%20Apache%20Hadoop%2C%20Spark%2C%20and%20Kafka%3A%20Effective%20Data%20Pipelines&extended_publisher_data=true&field=title|Scalable Analytics with Apache Hadoop, Spark, and Kafka]] - A complete data science investigation requires different tools and strategies. In this training, learn How to apply Hadoop, Spark, and Kafka tools to Predict Airline Delays. All programming will be done using HadoopSpark, and Kafka with the Zeppelin web notebook on a four node cluster. The notebook will be made available for download so student can reproduce the examples. (3 hours-1 day)|{{wiki: scalable-DS-course.png}}| 
 + 
 +=== About the Presenter === 
 +**Douglas Eadline**, began his career as Analytical Chemist with an interest in computer methods. Starting with the first Linux Beowulf How-to document, Doug has written instructional documents covering many aspects of Linux HPC, Hadoop, and Analytics computing. Currently, Doug serves as editor of the //ClusterMonkey.net// website and was previously editor of //ClusterWorld Magazine//, and senior HPC Editor for //Linux Magazine//. He is also a writer and consultant to the scalable HPC/Analytics industry. His recent video tutorials and books include of the Hadoop Fundamentals LiveLessons (Addison Wesley) video, Hadoop 2 Quick Start Guide (Addison Wesley), High Performance Computing for Dummies (Wiley) and Practical Data Science with Hadoop and Spark (Co-author, Addison Wesley). 
 + 
 +**Update:** I will be assisting with a new training entitled "Big Data Techniques" as part of an on-line (remote) [[https://www.juniata.edu/academics/graduate-programs/data-science.php| Masters in Data Science]] from Juniata College. The class will include more in-depth treatment of the topics and examples I cover in these trainings.
  
 ---- ----
 +
 +====Class Notes for Beginning Linux Command Line for Data Engineers and Analysts====
 +(Updated 22-Jan-2020)
 +  * [[First Steps for Command Line Class]]  
 +  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Begin-Linux-Command-Line-V1.0.tgz|Class Notes]] (tgz format)
 +  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Begin-Linux-Command-Line-V1.0.zip|Class Notes]] (zip format)
 +
 +====Class Notes for Intermediate Linux Command Line for Data Engineers and Analysts====
 +(Updated 29-Jan-2020)
 +  * [[First Steps for Command Line Class]]  
 +  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Inter-Linux-Command-Line-V1.0.tgz|Class Notes]] (tgz format)
 +  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Inter-Linux-Command-Line-V1.0.zip|Class Notes]] (zip format)
  
 ====Class Notes for Hands-on Introduction to Apache Hadoop and Spark Programming==== ====Class Notes for Hands-on Introduction to Apache Hadoop and Spark Programming====
-(Updated 03-June-2019) +(Updated 13-Feb-2020)
-  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Hands_On_Hadoop_Spark-V1.5.tgz|Class Notes]] (tgz format) +
-  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Hands_On_Hadoop_Spark-V1.5.zip|Class Notes]] (zip format)+
  
-===Class Notes for Practical Linux Command Line for Data Engineers and Analysts=== +  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Hands_On_Hadoop_Spark-V1.5.1.tgz|Class Notes]] (tgz format) 
-(Updated 19-Mar-2019) +  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Hands_On_Hadoop_Spark-V1.5.1.zip|Class Notes]] (zip format)
-  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Linux-Command-Line-V1.0.tgz|Class Notes]] (tgz format) +
-  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Linux-Command-Line-V1.0.zip|Class Notes]] (zip format)+
  
-====Zeppelin Notebook for Scalable Data Science with Hadoop and Spark=== +====Class Notes for Data Engineering at Scale with Apache Hadoop and Spark==== 
-(Updated 14-July-2019)+(Updated 23-Jun-2020)
  
-  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Scalable-Analytics.json|Scalable-Analytics.json]]+  * [[First Steps for Data Engineering Class]]   
 +  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Data-Engineering-at-Scale-V1.0.tgz|Class Notes]] (tgz format) 
 +  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Data-Engineering-at-Scale-V1.0.zip|Class Notes]] (zip format) 
 +  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Scalable-Data-Engineering.json|Data Engineering at Scale Zeppelin Notebook]] 
 + 
 +====Class Notes for Up and Running with Kubernetes==== 
 +(Updated 18-Jun-2020) 
 +  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Up_Running_Kubernetes-V1.0.tgz|Class Notes]] (tgz format) 
 +  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Up_Running_Kubernetes-V1.0.zip|Class Notes]] (zip format) 
 + 
 + 
 +=== Old Notes ==== 
 + 
 +[[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Old-Notes|Old Notes Files can be found here.]] 
 + 
 +====Zeppelin Notebook for Scalable Data Science with Hadoop and Spark==== 
 +(Updated 09-Jul-2020) 
 + 
 +  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Scalable-Analytics-V2.json|Scalable-Analytics-V2.json]] New version that uses Hive, Python, and PySpark 
 +  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Scalable-Analytics.json|Scalable-Analytics.json]] Old Version that uses Pig, Python, and PySpark
  
 ---- ----
  
-====DOS to Linux and Hadoop HDFS Help:==== +====Supporting Documents (Cheat Sheets)==== 
-  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/DOS-Linux-HDFS-cheatsheet.pdf|DOS to Linux/HDFS Cheat-sheet]] +  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Supporting-Docs/DOS-Linux-HDFS-cheatsheet.pdf|DOS to Linux/HDFS Cheat Sheet]] 
-  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/ericg_vi-editor.bw.pdf|vi (visual editor) Cheat-sheet]]+  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Supporting-Docs/ericg_vi-editor.bw.pdf|vi (visual editor) Cheat Sheet]] 
 +  * [[https://www.cs.colostate.edu/helpdocs/vi.html|Additional help with vi]] 
 +  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Supporting-Docs/Hortonworks.CheatSheet.SQLtoHive.pdf|Hortonworks SQL to Hive Cheat Sheet]] 
  
 ---- ----
Line 40: Line 79:
 ====Linux Hadoop Minimal (LHM) Virtual Machine Sandbox==== ====Linux Hadoop Minimal (LHM) Virtual Machine Sandbox====
  
-(Current Version 0.42, 03-June-2019) **Not ready for Scalable Data Science with Hadoop and Spark (soon)**+Used for //Hands-on//, //Command Line//, and //Scalable Data Science// trainings above. Note: This VM can also be used for the //Hadoop and Spark Fundamentals: LiveLessons// video mentioned below.
  
-Used for //Hands-on////Command Line//, and //Scalable Data Science// courses aboveNoteThis VM can also be used for the //Hadoop and Spark Fundamentals: LiveLessons// video mentioned below+===VERSION 2: (Current)===  
-  * [[Linux Hadoop Minimal Installation Instructions]] (Read First)  +(Updated 24-Jun-2020) 
-  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Linux-Hadoop-Minimal-0.42.MD5.txt|Linux Hadoop Minimal MD5]] +CentOS Linux 7.6, Anaconda 3:Python 3.7.4, R 3.6.0, Hadoop 3.2.1, Hive 3.1.2, Apache Spark 2.4.5, Derby 10.14.2.0, Zeppelin 0.8.2, Sqoop 1.4.7, Kafka 2.5.0. **Used in all current classes as of June 1, 2020.** 
-  * Linux Hadoop Minimal Virtual Machine OVA file [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Linux-Hadoop-Minimal-0.42.ova|US]] [[http://134.209.239.225/download/Linux-Hadoop-Minimal-0.42.ova|Europe]] (3.3G)+ 
 +  * [[Linux Hadoop Minimal Installation Instructions VERSION 2]] (Read First)  
 +  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Linux-Hadoop-Minimal-V2.0-beta2.MD5.txt|Linux Hadoop Minimal V2.0-beta2 MD5]] 
 +  * Linux Hadoop Minimal Virtual Machine V2.0-beta2 OVA file [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Linux-Hadoop-Minimal-V2.0-beta2.ova|US]] [[http://134.209.239.225/download/Linux-Hadoop-Minimal-V2.0-beta2.ova|Europe]] (5.0G) 
 + 
 +===VERSION 0.42(Deprecated)=== 
 +CentOS Linux 6.9, Apache Hadoop 2.8.1, Pig 0.17.0, Hive 2.3.2, Spark 1.6.3, Derby 10.13.1.1, Zeppelin 0.7.3, Sqoop 1.4.7, Flume-1.8.0. **Used in previous classes.** 
 + 
 +  * [[Linux Hadoop Minimal Installation Instructions VERSION 0.42]] (Read First)   
 +  * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Linux-Hadoop-Minimal-0.42.MD5.txt|Linux Hadoop Minimal V0.42 MD5]] 
 +  * Linux Hadoop Minimal Virtual Machine V0.42 file [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Linux-Hadoop-Minimal-0.42.ova|US]] [[http://134.209.239.225/download/Linux-Hadoop-Minimal-0.42.ova|Europe]] (3.3G)
   * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/old|Old Versions]]   * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/old|Old Versions]]
  
Line 55: Line 104:
  
 ---- ----
 +/*
 ====Zeppelin Web Notebook==== ====Zeppelin Web Notebook====
-For those taking the //Scalable Data Science// course a 30-day web-based Zeppelin Notebook is available from [[https://www.basement-supercomputing.com|Basement Supercomputing]]. Please use the [[Sign Up Form]] to get access to the notebook. +For those taking the //Scalable Data Science// training a 30-day web-based Zeppelin Notebook is available from [[https://www.basement-supercomputing.com|Basement Supercomputing]]. Please use the [[Sign Up Form]] to get access to the notebook. 
  
 ---- ----
 +*/
 ====Other Resources for all Classes==== ====Other Resources for all Classes====
   * Book: [[https://www.clustermonkey.net/Hadoop2-Quick-Start-Guide/| Hadoop® 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop® 2 Ecosystem]]   * Book: [[https://www.clustermonkey.net/Hadoop2-Quick-Start-Guide/| Hadoop® 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop® 2 Ecosystem]]
Line 74: Line 123:
 ---- ----
  
-**Unless otherwise noted, all course content, notes, and examples (c) Copyright Basement Supercomputing 2019, All rights reserved.**+**Unless otherwise noted, all training content, notes, and examples (c) Copyright Basement Supercomputing 2019, 2020 All rights reserved.**
  
  
start.txt · Last modified: 2024/01/29 21:19 by deadline