This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
linux_hadoop_minimal_installation_instructions [2019/06/11 14:00] deadline created |
linux_hadoop_minimal_installation_instructions [2020/05/21 18:46] (current) deadline |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | =====Linux Hadoop Minimal Notes===== | + | =====Linux Hadoop Minimal |
- | Version .42 | + | **Version:** .42\\ |
- | Date: June 3, 2019 | + | **Date:** June 3, 2019\\ |
- | Author: Douglas Eadline | + | **Author:** Douglas Eadline\\ |
+ | **Email:** deadline(you know what goes here)basement-supercomputing.com | ||
- | Unless otherwise noted, all course content, notes, and examples are | + | **Unless otherwise noted, all course content, notes, and examples are |
- | Copyright Basement Supercomputing 2019, All rights reserved. | + | (c) Copyright Basement Supercomputing 2019, All rights reserved.** |
- | + | ||
- | ===What Is This?=== | + | |
+ | ====What Is This?==== | ||
The Linux Hadoop Minimal is a virtual machine (VM) that can be used to | The Linux Hadoop Minimal is a virtual machine (VM) that can be used to | ||
- | try the examples presented in the two on-line | + | try the examples presented in the following |
- | | + | |
+ | * [[https:// | ||
- | " | + | It can also be used for the [[https:// |
+ | video tutorial (14+ hours): | ||
+ | * [[https:// | ||
- | It can also be used for the examples provided in the companion on-line | + | The machine has many important Hadoop and Spark packages installed and at the same time tries to keep the resource usage as low as possible so the VM can used on most laptops. (See below for resource recommendations) |
- | video tutorial (14+ hours) | + | |
- | + | ||
- | " | + | |
- | + | ||
- | The machine has many important Hadoop and Spark packages installed and | + | |
- | at the same time tries to keep the resource usage as low as possible | + | |
- | so the VM can used on most laptops. (See below for resource recommendations) | + | |
To learn more about the course and my other analytics books and videos, go to: | To learn more about the course and my other analytics books and videos, go to: | ||
- | https:// | + | |
PLEASE NOTE: This version of Linux Hadoop Minimal (LHM) is still considered | PLEASE NOTE: This version of Linux Hadoop Minimal (LHM) is still considered | ||
" | " | ||
- | deadline@eadline.org with " | + | deadline(you know what goes here)basement-supercomputing.com with " |
+ | |||
+ | ====Student Usage==== | ||
+ | If you have taken the " | ||
+ | |||
+ | For instance, to download and extract the archive for the " | ||
+ | |||
+ | wget https:// | ||
+ | tar xvzf Hands_On_Hadoop_Spark-V1.5.1.tgz | ||
+ | |||
+ | Similarly, for the "Linux Command Line" course (do this within the VM) | ||
+ | |||
+ | wget https:// | ||
+ | tar xvzf Linux-Command-Line-V1.0.tgz | ||
+ | |||
+ | If you want to move files from your local machine to the VM, then you can use '' | ||
+ | on your host. ('' | ||
+ | MobaXterm package on Windows) | ||
+ | |||
+ | scp -P2222 | ||
+ | |||
+ | '' | ||
+ | be used for most of the examples. Therefore, the command to copy file ('' | ||
+ | host system to the VM is (it places the file in ''/ | ||
+ | |||
+ | scp -P2222 | ||
+ | |||
+ | See the [[#Connect From Your Local Machine to the LHM Sandbox|Connect From Your Local Machine to the LHM Sandbox]] below for more information | ||
+ | on using '' | ||
+ | |||
+ | ====General Usage Notes==== | ||
+ | |||
+ | 1. The Linux Hadoop Minimal includes the following Apache software. Note: Spark 1.6.3 is installed because later versions need Python 2.7+ (not available in CentOS)\\ | ||
+ | < | ||
+ | CentOS Linux 6.9 minimal | ||
+ | Apache Hadoop 2.8.1 | ||
+ | Apache Pig 0.17.0 | ||
+ | Apache Hive 2.3.2 | ||
+ | Apache Spark 1.6.3 | ||
+ | Apache Derby 10.13.1.1 | ||
+ | Apache Zeppelin 0.7.3 | ||
+ | Apache Sqoop-1.4.7 | ||
+ | Apache Flume-1.8.0 | ||
+ | </ | ||
+ | |||
+ | 2. The Linux Hadoop Minimal has been tested with VirtualBox on Linux, MacOS 10.12, and Windows 10 Home addition. It has not been tested with VMware. | ||
+ | |||
+ | 3. The Linux Hadoop Minimal Virtual Machine is designed to work on minimal hardware. It is recommended at a MINIMUM your system have 2 cores, 4 GB memory, and 70G of disk space. The VM is set to use 2.5G of memory. This will cause some applications to swap to disk, but it should allow the virtual machine to run on a 4GB laptop/ | ||
+ | |||
+ | 4. The above packages have not been fully tested although all of the examples from the course should work. | ||
+ | |||
+ | ====Installation Steps==== | ||
+ | |||
+ | **Step 1:** Download and install VirtualBox for your environment. VirtualBox is freely available. Note: Some windows environments may need the Extension Pack. See the [[https:// | ||
+ | |||
+ | **Step 2:** Follow the installation instructions for your Operating System environment. For Red Hat based systems this page, https:// | ||
+ | If you are using Windows, you will need an "ssh client." | ||
+ | |||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | |||
+ | **Step 3:** Make sure hardware virtualization is enabled in your BIOS. | ||
+ | |||
+ | **Step 4:** Download the https:// | ||
+ | |||
+ | |||
+ | **Step 5:** Start the VM. All the essential Hadoop service should be started automatically. | ||
+ | |||
+ | |||
+ | ====Connect From Your Local Machine to the LHM Sandbox==== | ||
+ | |||
+ | It is possible to login and use the sandbox from the VirtualBox terminal, however, you will have much | ||
+ | more flexibility with local terminals. Follow the instructions below for local terminal access. | ||
+ | |||
+ | As a test, open a text terminal and connect to the sandbox as the root user with '' | ||
+ | Linux machines have '' | ||
+ | |||
+ | * [[https:// | ||
+ | |||
+ | The root password is: **hadoop** | ||
+ | |||
+ | ssh root@127.0.0.1 -p 2222 | ||
+ | |||
+ | You are should now be in the ''/ | ||
+ | |||
+ | To confirm all the Hadoop daemons have started enter '' | ||
+ | |||
+ | < | ||
+ | # jps | ||
+ | 1938 NetworkServerControl | ||
+ | 2036 ZeppelinServer | ||
+ | 1797 ResourceManager | ||
+ | 1510 NameNode | ||
+ | 1973 RunJar | ||
+ | 1576 SecondaryNameNode | ||
+ | 1882 JobHistoryServer | ||
+ | 1675 DataNode | ||
+ | 1962 RunJar | ||
+ | 1841 NodeManager | ||
+ | 2445 Jps | ||
+ | </ | ||
+ | |||
+ | ====Copying Files In and Out of the Virtual Machine==== | ||
+ | |||
+ | To copy a file from your LOCAL MACHINE into the VM, use the '' | ||
+ | |||
+ | scp -P2222 | ||
+ | |||
+ | To be clear, the above command is run on your '' | ||
+ | |||
+ | To copy a file from the VM to your '' | ||
+ | |||
+ | scp -P2222 hands-on@127.0.0.1:/ | ||
+ | |||
+ | To be clear, the above command is run on your '' | ||
+ | |||
+ | On Windows, the data will be placed in the MobaXterm " | ||
+ | |||
+ | C: | ||
+ | |||
+ | ====Adding Users==== | ||
+ | |||
+ | As configured, the LHM comes with one general user account. The account is called **hands-on** and the password is **minimal**. **It is highly recommended that this account be used for the class examples.** Remember you need to be user '' | ||
+ | |||
+ | To add yourself as a user with a different user name follow the following steps. | ||
+ | |||
+ | **Step 1.** As root do the following to create a user and add a password: | ||
+ | |||
+ | < | ||
+ | useradd -G hadoop USERNAME | ||
+ | passwd USERNAME | ||
+ | </ | ||
+ | |||
+ | **Step 2.** These steps change to user hdfs and create the user directory in HDFS (as root) | ||
+ | |||
+ | < | ||
+ | su - hdfs | ||
+ | hdfs dfs -mkdir / | ||
+ | hdfs dfs -chown USERNAME: | ||
+ | exit | ||
+ | </ | ||
+ | |||
+ | **Step 3.** Logout and login to the new account | ||
+ | |||
+ | ====Web Access==== | ||
+ | |||
+ | The various web interfaces shown in class are available using the following URLs. Enter the desired | ||
+ | URL in you local browser and the VM should respond. | ||
+ | < | ||
+ | HDFS web interface: | ||
+ | YARN Jobs web Interface: | ||
+ | Zeppelin Web Notebook: | ||
+ | </ | ||
+ | |||
+ | The Zeppelin interface is not configured (i.e. it is run in anonymous mode without the need to log-in). | ||
+ | The " | ||
+ | |||
+ | The '' | ||
+ | |||
+ | ==== Getting Data into Zeppelin==== | ||
+ | |||
+ | If you want to load you own data into a Zeppelin notebook, place the data in the zeppelin account under ''/ | ||
+ | |||
+ | # cp DATA / | ||
+ | # chown zeppelin: | ||
+ | |||
+ | This location is the default path for the Zeppelin interpreter (run '' | ||
+ | |||
+ | ==== Database for Sqoop Example==== | ||
+ | |||
+ | MySQL has been installed in the VM. The World database used in the Sqoop example from the class | ||
+ | has been preloaded into MySQL. SQL login and password for the Sqoop database is **sqoop** and **sqoop** | ||
+ | |||
+ | ====Log Files==== | ||
+ | |||
+ | There is currently no logfile management and log directly may fill up and use the sandbox storage. | ||
+ | There is a '' | ||
+ | This script will remove most of the Hadoop/ | ||
+ | |||
+ | =====Stopping and Starting the Hadoop Daemons===== | ||
+ | |||
+ | The Hadoop Daemons are started in the ''/ | ||
+ | is run when the system boots) The actual scripts are in ''/ | ||
+ | simple with no checking. If you are knowledgeable, | ||
+ | for errors and issues. The scripts are run in the following order: | ||
+ | |||
+ | / | ||
+ | / | ||
+ | / | ||
+ | / | ||
+ | / | ||
+ | / | ||
+ | |||
+ | A corresponding "stop script" | ||
+ | |||
+ | As mentioned, if all the the scripts are running, the '' | ||
+ | (run as root) should show the following (process numbers will be different). | ||
+ | The RunJar entrees are for the '' | ||
+ | |||
+ | # jps | ||
+ | 1938 NetworkServerControl | ||
+ | 2036 ZeppelinServer | ||
+ | 1797 ResourceManager | ||
+ | 1510 NameNode | ||
+ | 1973 RunJar | ||
+ | 1576 SecondaryNameNode | ||
+ | 1882 JobHistoryServer | ||
+ | 1675 DataNode | ||
+ | 1962 RunJar | ||
+ | 1841 NodeManager | ||
+ | 2445 Jps | ||
+ | |||
+ | For HDFS to be running correctly the following daemons need to be running: | ||
+ | |||
+ | NameNode | ||
+ | SecondaryNameNode | ||
+ | DataNode | ||
+ | |||
+ | If one or all are not running, run (as root) | ||
+ | |||
+ | / | ||
+ | / | ||
+ | |||
+ | For YARN to be running correctly the following daemons need to be running: | ||
+ | |||
+ | ResourceManager | ||
+ | JobHistoryServer | ||
+ | NodeManager | ||
+ | |||
+ | If one or all are not running, run (as root) | ||
+ | |||
+ | / | ||
+ | / | ||
+ | |||
+ | A local metadata database (called Derby) is needed for Hive, if | ||
+ | the '' | ||
+ | the derby daemon: | ||
+ | |||
+ | / | ||
+ | / | ||
+ | |||
+ | Spark can use Hive tables through a hive-metastore and hiveserver2 service. To stop and restart the services (in the following order) | ||
+ | |||
+ | / | ||
+ | / | ||
+ | / | ||
+ | / | ||
+ | |||
+ | Finally, if the Zeppelin web page cannot be reached, the Zeppelin daemon | ||
+ | may no be running. Stop and restart the daemon: | ||
+ | |||
+ | / | ||
+ | / | ||
+ | |||
+ | If any or all of the daemons will not start after the above procedure | ||
+ | then the is a bigger issue with the VM. Please contact Eadline | ||
+ | and describe the situation. | ||
+ | |||
+ | When the VM is stopped (see below) with '' | ||
+ | |||
+ | ====Stopping the VM==== | ||
+ | To stop the VM, click on " | ||
+ | select the "Save State" option. The next time the machine starts it will have all the | ||
+ | changes you made. | ||
+ | |||
+ | Alternatively, | ||
+ | |||
+ | # poweroff | ||
+ | |||
+ | And the VM will gracefully shutdown the Hadoop/ | ||
+ | |||
+ | |||
+ | ====VM Installation Documentation==== | ||
+ | |||
+ | Please see ''/ | ||
+ | |||
+ | ====Issues/ | ||
+ | |||
+ | These issues have been addressed in the current version of the VM. Please use the lasted VM and you can avoid these issues. | ||
+ | |||
+ | 1. If you have problems loading the OVA image into VirtualBox, check the MD5 signature of the OVA file. The MD5 signature returned by running the program below should match the signature provided [[https:// | ||
+ | |||
+ | For **Linux** use " | ||
+ | |||
+ | $ md5sum Linux-Hadoop-Minimal-0.42.ova | ||
+ | |||
+ | For **Macintosh** use " | ||
+ | |||
+ | $ md5 Linux-Hadoop-Minimal-0.42.ova | ||
+ | |||
+ | For **Windows 10** (in PowerShell) use " | ||
+ | |||
+ | C: | ||
+ | |||
+ | 2. Either create your own user account as described above or use the existing " | ||
+ | |||
+ | 3. If zip is not installed on your version of the VM, you can install it by entering the following, as root, and a " | ||
+ | |||
+ | # yum install zip | ||
+ | | ||
+ | | ||
+ | | ||
+ | Total download size: 259 k | ||
+ | | ||
+ | Is this ok [y/N]: y | ||
+ | | ||
+ | | ||
+ | | ||
+ | |||
+ | 4. In previous versions there is a permission issue in HDFS that prevents Hive jobs from working. To fix it, perform the following steps: | ||
+ | |||
+ | a) login to the VM as root (pw=" | ||
+ | |||
+ | ssh root@127.0.0.1 -p 2222 | ||
+ | |||
+ | b) then change to hdfs user | ||
+ | |||
+ | su - hdfs | ||
+ | |||
+ | c) fix the permission error: | ||
+ | |||
+ | hdfs dfs -chmod o+w / | ||
+ | |||
+ | d) Check the result | ||
+ | |||
+ | hdfs dfs -ls / | ||
+ | |||
+ | e) The output of the previous command should look like: | ||
+ | |||
+ | Found 1 items | ||
+ | | ||
+ | |||
+ | f) Exit out of the hdfs account | ||
+ | |||
+ | | ||
+ | |||
+ | g) exit out the root account | ||
- | ====Student Usage:==== | + | exit |
+ | You should now be back at the terminal on your laptop/ | ||