Linux Hadoop Minimal Notes
Version .42
Date: June 3, 2019
Author: Douglas Eadline (Email: deadline(you know what goes here)basement-supercomputing.com)

Unless otherwise noted, all course content, notes, and examples are
Copyright Basement Supercomputing 2019, All rights reserved.

What Is This?
=============

The Linux Hadoop Minimal is a virtual machine (VM) that can be used to 
try the examples presented in the two on-line course entitled:

  "Hands-on Introduction to Apache Hadoop and Spark Programming"

  "Practical Linux Command Line for Data Engineers and Analysts"


It can also be used for the examples provided in the companion on-line 
video tutorial (14+ hours)

   "Hadoop and Spark Fundamentals: LiveLessons"

The machine has many important Hadoop and Spark packages installed and
at the same time tries to keep the resource usage as low as possible
so the VM can used on most laptops. (See below for resource recommendations)

To learn more about the course and my other analytics books and videos, go to:

  https://www.safaribooksonline.com/search/?query=eadline

PLEASE NOTE: This version of Linux Hadoop Minimal (LHM) is still considered
"beta."  If you use it and find problems, please send any issues to 
deadline@eadline.org with "LHM" in the subject line.

Student Usage:
==============

If you have taken the "Hands-on" course mentioned above, you can download
the NOTES.txt files, examples, and data archive directly to the VM
using "wget" The archive is in both compressed tar (tgz) and
Zip (zip) format. It is recommended that you either make a new user account
or use the "hands-on" account for the archive (and run most of the examples from 
this account).

For instance, to download and extract the archive for the "Hands-on" course from within the VM:

  wget https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Hands_On_Hadoop_Spark-V1.5.tgz
  tar xvzf Hands_On_Hadoop_Spark-V1.5.tgz 

Similarly, for the "Linux Command Line" course (do this within the VM)

  wget https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Linux-Command-Line-V1.0.tgz
  tar xvzf Linux-Command-Line-V1.0.tgz

If you want to move files from your local machine to the VM, then you can use "scp"
on your host. (scp natively available on Linux and Macintosh systems, it is part of the 
MobaXterm package on Windows)

   scp -P2222  SOURCE-FILE USERNAME@127.0.0.1:PATH  

USERNAME is a valid account on the VM. There is a user account called "hands-on" that can
be used for most of the examples. Therefore the command to copy file (SOURCE-FILE) from your
host system to the VM is:

   scp -P2222  SOURCE-FILE hands-on@127.0.0.1:/home/hands-on
  
See the "Connect From Your Local Machine to the LHM Sandbox" below for more information
on using ssh and scp.

USAGE NOTES:
============

1. The Linux Hadoop Minimal includes the following Apache software
 
   CentOS Linux 6.9 minimal
   Apache Hadoop 2.8.1
   Apache Pig 0.17.0
   Apache Hive 2.3.2
   Apache Spark 1.6.3
   Apache Derby 10.13.1.1
   Apache Zeppelin 0.7.3
   Apache Sqoop-1.4.7
   Apache Flume-1.8.0

   Spark 1.6.3 is installed because later versions need Python 2.7+ (not available in CentOS) 

2. The Linux Hadoop Minimal has been tested with VirtualBox on Linux, MacOS 10.12, and Windows 10
   Home addition. It has not been tested with VMware.

3. The Linux Hadoop Minimal Virtual Machine is designed to work on minimal hardware. 
   It is recommended at a MINIMUM your system have 2 cores, 4 GB memory, and 70G of disk space.
   The VM is set to use 2.5G of memory. This will cause some applications to swap to disk,
   but it should allow the virtual machine to run on a 4GB laptop/desktop. 
   
   (If you are thinking of using the Hortonworks sandbox then 4+ cores and 16+ GB of memory is
   recommended)

4. The above packages have not been fully tested although all of the examples from the course work.

Installation Steps:
-------------------

1. Download and install VirtualBox for your environment. VirtualBox is freely available.
   Note: Some windows environments may need the Extension Pack.

    https://www.virtualbox.org

2. Follow the installation instructions for your Operating System environment. For Red Hat based systems this 
   page, https://tecadmin.net/install-oracle-virtualbox-on-centos-redhat-and-fedora, is helpful. With Linux
   there is some dependencies on kernel versions and modules that need to be addressed. 

   If you are using Windows, you will need an "ssh client." Either of these will work.
   They are both freely available at no cost. (MobaXterm is recommended)

     1. Putty: http://www.putty.org (provides terminal for ssh session)
     2. MobaXterm: http://mobaxterm.mobatek.net (provides terminal for ssh sessions and allows remote X Windows session)

3. Make sure hardware virtualization is enabled in your BIOS.

4. Download  the https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Linux-Hadoop-Minimal-0.41.ova
   image and load into VirtualBox. (NOTE newer version may be available.) 

5. Start the VM. All the essential Hadoop service should be started automatically.


Connect From Your Local Machine to the LHM Sandbox:
---------------------------------------------------

It is possible to login and use the sandbox from the VirtualBox terminal, however, you will have much
more flexibility with local terminals. Follow the instructions below for local terminal access.

As a test, open a text terminal and connect to the sandbox as the root user with ssh. Macintosh and 
Linux machines have ssh and terminal installed, for windows see above (Putty or MobaXterm) or this document:

  https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/DOS-Linux-HDFS-cheatsheet.pdf

The root password is: hadoop

 ssh root@127.0.0.1 -p 2222

You are should now be in the /root directory

To confirm all the Hadoop daemons have started enter "jps" as root
The results should list the 10 daemons as shown below. (process numbers
will be different)

# jps
1938 NetworkServerControl
2036 ZeppelinServer
1797 ResourceManager
1510 NameNode
1973 RunJar
1576 SecondaryNameNode
1882 JobHistoryServer
1675 DataNode
1962 RunJar
1841 NodeManager
2445 Jps

Copying Files Into and Out of the Virtual Machine (VM)
------------------------------------------------------

To copy a file from your LOCAL MACHINE into the VM, use the "scp" command.
For instance, to copy the file "SOURCE-FILE" from your local directory on your 
LOCAL MACHINE to the "hands-on" account. The password is "minimal" and
the command places file in /home/hands-on directory in the VM.

  scp -P2222  SOURCE-FILE  hands-on@127.0.0.1:/home/hands-on 

To be clear, the above command is run on your LOCAL MACHINE.
On Macintosh and Linux systems run this from a terminal. On Windows 
run it from MobaXterm.

To copy a file from the VM to your LOCAL MACHINE and place it
in your current directory use the following. (don't forget the "."):

  scp -P2222 hands-on@127.0.0.1:/home/hands-on/SOURCE-FILE .

To be clear, the above command is run on your LOCAL MACHINE.

On Windows, the data will be placed in the MobaXterm "Persistent
Home Directory." In the case of Windows 10 with user "Doug"
this would be the following:

  C:\Users\Doug\Documents\MobaXterm\home

Adding Users:
-------------
As configured, the LHM comes with one general user account. The account is called "hands-on" and the password
is "minimal" You can run everything under this account (but remember you need to be user "hdfs" to
do any administrative work in HDFS. The hdfs account has no password. To become the hdfs user,
log in as root and issue a "su - hdfs" command.

Warning: Running as user "hdfs" gives you full "root" control of the HDFS file system. 

To add yourself as a user.

   a) As root do the following to create a user and add a password:
   
      useradd -G hadoop USERNAME
      passwd USERNAME

   b) These steps change to user hdfs and create the user directory in HDFS (as root)

      su - hdfs
      hdfs dfs -mkdir /user/USERNAME
      hdfs dfs -chown USERNAME:hadoop /user/USERNAME
      exit

    c) Logout and login to the new account   

Web Access:
-----------

The various web interfaces shown in class are available using the following URLs. Enter the desired
URL in you local browser and the VM should respond. 

  HDFS web interface:       http://127.0.0.1:50070
  YARN Jobs web Interface:  http://127.0.0.1:8088
  Zeppelin Web Notebook:    http://127.0.0.1:9995

The Zeppelin interface is not configured (i.e. it is run in anonymous mode without the need to log-in).
The "Zeppelin Tutorial/Basic Features" notebook used in class works as does some of the SparkR notebooks.
The "PySpark Example" that was demonstrated in class also works. Also, the "md" and "sh" interpreters have been tested 
and work. 

Loading Data into Zeppelin:
---------------------------

If you want to load you own data into a Zeppelin notebook, place the data in the zeppelin account under /home/zeppelin.
Login as root to place data in this account then change the ownership to zeppelin user for example:

  # cp DATA /home/zeppelin
  # chown zeppelin:hadoop /home/zeppelin/DATA

This location is the default path for the Zeppelin interpreter (run "pwd" in the %sh interpreter).

World Database for Sqoop Example:
---------------------------------

MySQL has been installed in the VM. The World database used in the Sqoop example from the class
has been preloaded into MySQL.

Log Files:
----------

There is currently no logfile management and log directly may fill up and use the sandbox storage. 
There is a clean-logs.sh script in /root/Hadoop-Minimal-Install-Notes/Hadoop-Pig-Hive/scripts/ 
This script will remove most of the Hadoop/Spark and system logs (somewhat aggressive)

Stopping and Starting the Hadoop Daemons:
-----------------------------------------

The Hadoop Daemons are started in the /etc/rc.local file (the last script file that
is run when the system boots) The actual scripts are in /usr/sbin and are very
simple with no checking. If you are knowledgeable, you can check /var/log/boot.log 
for errors and issues. The scripts are run in the following order

  /usr/sbin/start-hdfs.sh
  /usr/sbin/start-yarn.sh
  /usr/sbin/start-derby.sh
  /usr/sbin/start-hive-metastore.sh  
  /usr/sbin/start-hiveserver2.sh
  /usr/sbin/start-zeppelin.sh 

A corresponding "stop script" is run when the system is shutdown or rebooted.

As mentioned, if all the the scripts are running, the "jps" command
(run as root) should show the following (process numbers will be different)
The RunJar entrees are for the hiveserver2 and hive-metastore processes

  # jps
  1938 NetworkServerControl
  2036 ZeppelinServer
  1797 ResourceManager
  1510 NameNode
  1973 RunJar
  1576 SecondaryNameNode
  1882 JobHistoryServer
  1675 DataNode
  1962 RunJar
  1841 NodeManager
  2445 Jps

For HDFS to be running correctly the following daemons need to be running:
  
  NameNode
  SecondaryNameNode
  DataNode

If one or all are not running, run (as root)

  /usr/sbin/stop-hdfs.sh
  /usr/sbin/start-hdfs.sh
 
For YARN to be running correctly the following daemons need to be running:

  ResourceManager
  JobHistoryServer
  NodeManager

If one or all are not running, run (as root)

  /usr/sbin/stop-yarn.sh
  /usr/sbin/start-yarn.sh

A local metadata database (called Derby) is needed for Hive, if 
the "NetworkServerControl" daemon is not running, then stop and restart
the derby daemon:

  /usr/sbin/stop-derby.sh 
  /usr/sbin/start-derby.sh 

So that Spark can use Hive tables there is hive-metastore and hiveserver2 service needed. 
To stop and restart the services (in the following order)

  /usr/sbin/stop-hiveserver2.sh
  /usr/sbin/stop-hive-metastore.sh
  /usr/sbin/start-hive-metastore.sh
  /usr/sbin/start-hiveserver2.sh


Finally, if the Zeppelin web page cannot be reached, the Zeppelin daemon
may no be running. Stop and restart the daemon:

  /usr/sbin/stop-zeppelin.sh 
  /usr/sbin/start-zeppelin.sh 

If any or all of the daemons will not start after the above procedure
then the is a bigger issue with the VM. Please contact Eadline
and describe the situation. 


Stopping the VM
---------------

To stop the VM, click on "machine" in the VirtualBox menu bar. Select "Close" and then
select the "Save State" option. The next time the machine starts it will have all the
changes you made. 

Alternatively, you can enter (as root user) from within the VM:

  # poweroff

And the VM will gracefully shutdown the Hadoop/Spark services and preserve any changes you made.


VM Installation Documentation
-----------------------------

Please see /root/Hadoop-Minimal-Install-Notes directory for how the packages were installed. 


Issues/Bugs
-----------

1. Either create your own user account as described above or use the existing "hands-on" user account.
   The examples will not work if run as the root account.

2. If zip is not installed on your version of the VM, you can install it by entering
   the following, as root, and a "y" when asked. Zip will now be installed and available for use.

   # yum install zip
   Loaded plugins: fastestmirror
   Setting up Install Process
   [...TEXT ...]
   Total download size: 259 k
   Installed size: 804 k
   Is this ok [y/N]: y
   [...TEXT ...]
   Installed:
     zip.x86_64 0:3.0-1.el6_7.1            

3. In previous versions there is a permission issue in HDFS that prevents Hive jobs 
   from working. To fix it, perform the following steps:

  a) login to the VM as root (pw="hadoop")

     ssh root@127.0.0.1 -p 2222

  b) then change to hdfs user

     su - hdfs

  c) fix the permission error:

     hdfs dfs -chmod o+w /user/hive/warehouse

  d) Check the result 

     hdfs dfs -ls /user/hive

  e) The output of the previous command should look like:
      
     Found 1 items
     drwxrwxrwx   - hive hadoop          0 2019-01-24 20:43 /user/hive/warehouse

  f) Exit out of the hdfs account

     exit

  g) exit out the root account

     exit

You should now be back at the terminal on your laptop/desktop