Scalable PySpark for Data Science

The following steps explain how load and start the Linux Hadoop Minimal Virtual Machine (LHM-VM) and download the course notes files. A full and expanded explanation is provided as part of the class. The following steps are a “quick start.”

If you are using Linux or Mac, a terminal application is available that includes an “ssh client.”

If you are using Windows, you will need an “ssh client.” Either of these listed below will work. They are both freely available at no cost. (MobaXterm is recommended)

Putty - provides terminal for ssh session.
MobaXterm - provides terminal for ssh sessions and allows remote X Windows session.

See Linux Hadoop Minimal Installation Instructions for instructions on how to start the Linux Hadoop Minimal Virtual Machine (LHM-VM)

When the VM is Started

Open a terminal (using Putty or MobaXterm on Windows) and enter the following to log in to the LHM-VM as user “hands-on” (password=“minimal”). Note: use MobaXterm if you want to use the Kafkaesque graphical tool.

  ssh hands-on@127.0.0.1 -p 2222

Once you are logged in to the LHM-VM, you should see the following prompt string:

  [hands-on@localhost ~]$

The [hands-on@localhost ~] will not be shown in the rest of the class documentation. A $ will indicate the prompt string for input.

To download the Kafka Methods and Administration class notes into the LHM-VM, pull down and extract the course files (from inside the LHM-VM) as shown below:

  $ wget --no-check-certificate https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Scalable-PySpark-v1.tgz
  $ tar xvzf Scalable-PySpark-v1.tgz

These steps will be performed as part of the class.

Live On-Line Training: Scalable Data Pipelines with Hadoop, Spark, and Kafka

User Tools

Site Tools

Scalable PySpark for Data Science

When the VM is Started

Page Tools

Live On-Line Training:
Scalable Data Pipelines with Hadoop, Spark, and Kafka