| Both sides previous revision
Previous revision
Next revision
|
Previous revision
|
start [2024/01/07 23:17] deadline [Class Notes for Scalable PySpark for Data Science] added notebook |
start [2026/02/09 21:32] (current) deadline [About the Presenter] |
| |
| Contact: ''deadline''(you know what goes here)''eadline''(and here)''org''\\ | Contact: ''deadline''(you know what goes here)''eadline''(and here)''org''\\ |
| Mast: @thedeadline@mast.hpc.social \\ | |
| Twitter: @thedeadline | * Mast: @thedeadline@mast.hpc.social \\ |
| | * Twitter: @thedeadline |
| | * BlueSky:@thedeadline.bsky.social |
| |
| ---- | ---- |
| * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Getting-Started-Kafka-V2.1.tgz|Class Notes]] (tgz format) | * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Getting-Started-Kafka-V2.1.tgz|Class Notes]] (tgz format) |
| * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Getting-Started-Kafka-V2.1.zip|Class Notes]] (zip format) | * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Getting-Started-Kafka-V2.1.zip|Class Notes]] (zip format) |
| | * Additional [[https://www.clustermonkey.net/download/Eadline/Lehigh/Week-01/Install-KafkaEsque-Local-Mac-M.pdf|note]] for running Kafkaesque on Apple M based systems (Linux Virtual Machines running on UTM) |
| |
| ====Class Notes for Kafka Methods and Administration ==== | ====Class Notes for Kafka Methods and Administration ==== |
| * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Scalable-PySpark-v1.tgz|Class Notes]] (tgz format) | * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Scalable-PySpark-v1.tgz|Class Notes]] (tgz format) |
| * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Scalable-PySpark-v1.zip|Class Notes]] (zip format) | * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Scalable-PySpark-v1.zip|Class Notes]] (zip format) |
| * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Scalable_PySpark_with_CSV_Files_and_Hive_Tables.json| PySpark for Data Science Zeppelin Notebook]] (Right Click, Save Link As ...) | * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Zeppelin-Notebooks/Scalable_PySpark_with_CSV_Files_and_Hive_Tables.json| PySpark for Data Science Zeppelin Notebook]] (Right Click, Save Link As ...) |
| |
| === Old Notes ==== | === Old Notes ==== |
| Used for //Hands-on//, //Command Line//, and //Scalable Data Science// trainings above. Note: This VM can also be used for the //Hadoop and Spark Fundamentals: LiveLessons// video mentioned below. | Used for //Hands-on//, //Command Line//, and //Scalable Data Science// trainings above. Note: This VM can also be used for the //Hadoop and Spark Fundamentals: LiveLessons// video mentioned below. |
| |
| ===VERSION 2-beta8: (Current)=== | ===VERSION 3.0-beta-2: (Current)=== |
| === IMPORTANT: VirtualBox will not work on the new Apple M1 based systems ==== | (Updated Jan-25-2024) |
| |
| (Updated Aug-08-2022) | [[Linux Hadoop Minimal Installation Instructions VERSION 3]] |
| CentOS Linux 7.6, Anaconda 3:Python 3.7.4, R 3.6.0, Hadoop 3.3.0, Hive 3.1.2, Apache Spark 2.4.5, Derby 10.14.2.0, Zeppelin 0.8.2, Sqoop 1.4.7, Kafka 2.5.0, HBase 2.4.10, NiFi 1.17.0, KafkaEsque. **Used in all current trainings.** | |
| | Contents: Rocky Linux 9.7: Python 3.9.25, R 4.5.2, Hadoop 3.3.6, Hive 4.0.1, Apache Spark 3.5.6, Derby 10.14.2.0, Zeppelin 0.11.2, Sqoop 1.4.7, Kafka 3..4.1, HBase 2.6.2, NiFi 1.17.0, KafkaEsque. **Used in all classes, trainings, and workshops after January 1, 2026.** |
| | |
| | ===VERSION 2-8.1: (Previous, no longer supported)=== |
| | (Updated Jan-25-2024) |
| | |
| | [[Linux Hadoop Minimal Installation Instructions VERSION 2]] |
| | |
| | Contents: CentOS Linux 7.6, Anaconda 3:Python 3.7.4, R 3.6.0, Hadoop 3.3.0, Hive 3.1.2, Apache Spark 2.4.5, Derby 10.14.2.0, Zeppelin 0.8.2, Sqoop 1.4.7, Kafka 2.5.0, HBase 2.4.10, NiFi 1.17.0, KafkaEsque. **Used in all classes, trainings, and workshops prior to January 1, 2026).** |
| |
| * [[Linux Hadoop Minimal Installation Instructions VERSION 2]] (Read First) | |
| * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Linux-Hadoop-Minimal-V2.0-beta8.MD5.txt|Linux Hadoop Minimal V2.0-beta8 MD5]] | |
| * Linux Hadoop Minimal Virtual Machine V2.0-beta8 OVA file [[http://161.35.229.207/download/Linux-Hadoop-Minimal-V2.0-beta8.ova|US]] [[http://134.209.239.225/download/Linux-Hadoop-Minimal-V2.0-beta8.ova|Europe]] (11.0G) **NOTE:** Chrome may prevent //http// downloads, right click the link, choose "Save Link As" then click "Keep" next to the blue discard box at the bottom of the browser. | |
| * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Hadoop-Minimal-Install-Notes-V2-beta8.tgz|Hadoop Minimal Build Notes (tgz format)]] | |
| |
| ---- | |
| |
| ====Cloudera-Hortonworks HDP Sandbox==== | |
| |
| The Cloudera-Hortonworks HDP Sandbox, a full featured Hadoop/Spark virtual machine that runs under Docker, VirtualBox, or VMWare. Please see [[https://www.cloudera.com/downloads/hortonworks-sandbox.html|Cloudera/Hortonworks HDP Sandbox]] for more information. Due to the number of applications the HDP Sandbox can require substantial resources to run. | |
| |
| ---- | |
| /* | |
| ====Zeppelin Web Notebook==== | |
| For those taking the //Scalable Data Science// training a 30-day web-based Zeppelin Notebook is available from [[https://www.basement-supercomputing.com|Basement Supercomputing]]. Please use the [[Sign Up Form]] to get access to the notebook. | |
| |
| ---- | ---- |
| */ | |
| ====Other Resources for all Classes==== | ====Other Resources for all Classes==== |
| * Book: [[https://www.clustermonkey.net/Hadoop2-Quick-Start-Guide/| Hadoop® 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop® 2 Ecosystem]] | * Book: [[https://www.clustermonkey.net/Hadoop2-Quick-Start-Guide/| Hadoop® 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop® 2 Ecosystem]] |
| * Book: [[https://www.clustermonkey.net/Practical-Data-Science-with-Hadoop-and-Spark|Practical Data Science with Hadoop® and Spark: Designing and Building Effective Analytics at Scale]] | * Book: [[https://www.clustermonkey.net/Practical-Data-Science-with-Hadoop-and-Spark|Practical Data Science with Hadoop® and Spark: Designing and Building Effective Analytics at Scale]] |
| * Video Tutorial: [[https://www.oreilly.com/videos/data-engineering-foundations/9780137440580|Data Engineering Foundations Part 1: LiveLessons: Using Spark, Hive, and Hadoop® Tools]] | * Video Tutorial: [[https://www.oreilly.com/videos/data-engineering-foundations/9780137440580|Data Engineering Foundations Part 1: LiveLessons: Using Spark, Hive, and Hadoop® Tools]] |
| * Video Tutorial (**NEW**): [[https://www.informit.com/store/data-engineering-foundations-part-2-building-data-pipelines-9780138086992|Data Engineering Foundations Part 2: Building Data Pipelines with Kafka and Nifi ]] | * Video Tutorial: [[https://www.informit.com/store/data-engineering-foundations-part-2-building-data-pipelines-9780138086992|Data Engineering Foundations Part 2: Building Data Pipelines with Kafka and Nifi ]] |
| | * Video Tutorial (**NEW**): [[https://www.oreilly.com/library/view/kafka-essentials-livelessons/9780138176761/|Kafka Essentials LiveLessons: A Quick-Start for Building Effective Data Pipelines ]] |
| |
| ---- | ---- |
| ---- | ---- |
| |
| **Unless otherwise noted, all training content, notes, and examples (c) Douglas Eadline 2019-2023 All rights reserved.** | **Unless otherwise noted, all training content, notes, and examples (c) Douglas Eadline 2019-2024 All rights reserved.** |
| |
| |