=====Linux Hadoop Minimal VM Notes VERSION 2=====

**NOTE: The LHM uses CentOS7 which has reached EOL. An update to Rocky Linux 9 is planned. The LHM is generally safe to use because services do not connect to the Internet. **

=== There are now two versions of the LHM (Intel-x86_64 and Apple-M) ===

  * The Intel version is for all Windows, Mac, and Linux systems that use Intel x86_64 based processors. 
  * The Arm version is for all Mac systems that use the newer Apple M1 and M2 processors.

=== Intel x86_64 with VirtualBox===
   * Version: 2.0-8.1\\
   * Release Date: 25-Jan-2024

=== Apple M1, M2 with UTM ===
  * Version: 2.0-M8.2
  * Release Date: 15-Feb-2025

=== Issues with Either version of the LHM===
  * Author:  Douglas Eadline
  * Email:  ''deadline''(you know what goes here)''limulus-computing''(and here)''com''


**Unless otherwise noted, all course content, notes, and examples are
(c) Copyright Limulus Computing, Douglas Eadline 2022,2023, All rights reserved.**

====What Is This?====

The Linux Hadoop Minimal is a virtual machine (VM) that can be used to try the examples presented in many of the academic classes, on-line trainings (mentioned on the main page), and any of Doug Eadline's instructional videos or books. It provides a fully operational Linux environment that runs Apache Hadoop, Spark, Hive, Kafka, Nifi, HBase, and Sqoop. 

The machine also has many supporting packages installed and at the same time tries to keep the resource usage as low as possible so the VM can used on most laptops. (See below for resource recommendations)

To learn more about the on-line courses and my other analytics books and videos, go to:

  * [[https://www.safaribooksonline.com/search/?query=eadline|Safari Books Online]]

PLEASE NOTE: This version of Linux Hadoop Minimal (LHM) is still considered
"beta."  If you use it and find problems, please send any issues to
deadline(you know what goes here)basement-supercomputing.com with "LHM" in the subject line.

====Installation Steps for Intel-x86_64 Based Hosts (with Videos) ====


These instructions are for **Intel based (x86_64) Windows, Mac, and Linux machines**.

**Hardware Requirements:** To run the VM you will need, an **x86_64 processor** with A MINIMUM of 4+ cores/threads, 4+ GB memory, 70G of disk space, and support for HW virtualization. 

There are now installation videos available for Intel based Windows 10 and MacOS. The videos cover installing the LHM into VirtualBox and and how to log into the virtual machine using MobaXterm, Putty, or PowerShell for Windows 10 and Term using MacOS. (NOTE: The current version of LHM may be higher than what is displayed in the videos.)

   * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/video/Extra-Starting-LHM-Mac.html| VIDEO: Installing and using LHM on a MacOS]]\\
   * [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/video/Extra-Starting-LHM-Windows.html|VIDEO: Installing and using LHM on Windows 10]]

**The videos cover the following steps:**

**Step 1:** Download and install VirtualBox for your environment. VirtualBox is freely available. Note: Some windows environments may need the Extension Pack. See the [[https://www.virtualbox.org|Virtual Box Web Page]].

**Step 2:** Follow the installation instructions for your Operating System environment. For Red Hat based systems this page, https://tecadmin.net/install-oracle-virtualbox-on-centos-redhat-and-fedora, is helpful. With Linux there is some dependencies on kernel versions and modules that need to be addressed.\\

If you are using **Windows**, you will need an "ssh client." Either of these will work. They are both freely available at no cost. (MobaXterm is recommended)\\

    * [[http://www.putty.org|Putty]] (provides terminal for ssh session)\\
    * [[http://mobaxterm.mobatek.net|MobaXterm]] (provides terminal for ssh sessions and allows remote X Windows session)

If you are using **Mac OS** system and want to run X-Windows applications you will need to install the [[https://www.xquartz.org/|XQuartz]] X-Windows library package.

**Step 3:** Make sure hardware virtualization is enabled in your BIOS.

**Step 3a:** On MacIntosh Systems with //Big Sur//, you may get the ''Kernel driver not installed (rc=-1908)'' error. The error is do to new security levels in MacOS. See this [[ https://www.howtogeek.com/658047/how-to-fix-virtualboxs-%E2%80%9Ckernel-driver-not-installed-rc-1908-error/|page]] for a fix. 

**Step 4:** Download the current Linux-Hadoop-Minimal-V2.0-xxxxx.ova image [[http://161.35.229.207/download/Linux-Hadoop-Minimal-V2.0-8.1.ova|US]] [[http://134.209.239.225/download/Linux-Hadoop-Minimal-V2.0-8.1.ova|Europe]] and load it into VirtualBox. **NOTE:** Chrome may prevent ''http'' downloads, right click the link, choose “Save Link As” then click “Keep” next to the download box.

**Step 5:** Start the VM. All the essential Hadoop service should be started automatically.

**HINT:** If your laptop (desktop) has more than 4GB or memory, you can increase the amount of memory for the LHM virtual machine. Before the LHM is started, go to the VirtualBox GUI, select the LHM, then select ''Settings/System''. Use the ''Base Memory'' slider to add more memory to the LHM. Make sure you leave enough for the base operating system (Windows, MacOS, Linux) to run. 

**IMPORTANT** The LHM should be stopped when not in use. The running services should be stopped in a ''graceful manner'' (powered down) as frequent standby or a sudden power interruption can leave some services in a broken state. To stop the LHM, it is suggested the you login as the root user (see [[linux_hadoop_minimal_installation_instructions_version_2#stopping_the_vm|Stopping the VM]] ) and issue the ''poweroff'' command. This will ensure a safe and orderly shutdown of the machine.

====Installation Steps for Apple-M Based Hosts ====

**25-Jan-2024 VERSION-UPDATED**

There is a full LHM for the Apple M based machines. The following (aarch64) packages are installed: CentOS Linux 7.6, Python3 3.6.8, R 3.6.0, Hadoop 3.3.0, Hive 3.1.2, Apache Spark 2.4.5, Derby 10.14.2.0, Zeppelin 0.8.2, Sqoop 1.4.7, Kafka 2.5.0, HBase 2.4.10, NiFi 1.17.0 (KafkaEsque is not installed, but KafkaEaque on the host system can connect to Kafka)

To run the VM on an Apple-M1 based machine, Perform the following steps: 

**Step 1:** Download and install [[ https://mac.getutm.app/| UTM ]] (use the free direct download, the Mac Store version is $10, helps pay for development)) UTM is similar to VirtualBox -- it runs and manages the virtual machines.


**Step 2:** Download the LHM for Apple-M1 into your ''Downloads'' folder. The current version is [[http://161.35.229.207/download/Linux-Hadoop-Minimal-V2.0-M8.2.utm.zip|US]] [[http://134.209.239.225/download/Linux-Hadoop-Minimal-V2.0-M8.2.utm.zip|Europe]] (9G) The [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Linux-Hadoop-Minimal-V2.0-M8.1.utm.zip.MD5.txt| MD5 file ]] can be used to verify the integrity of the download. **NOTE:** Google Chrome may prevent ''http'' downloads, right click the link, choose “Save Link As” then click “Keep” next to download box.

**Step 3:** The downloaded LHM is a zip file. Click to extract the file. It should create a directory called ''Linux-Hadoop-Minimal-V2.0-M8.2.utm''

**Step 4:** Start UTM (it will be in the applications folder, suggest keeping it in your dock) Click the ''+'' to the left of the ''UTM'' at the top. A window called ''Start'' will open. Toward the bottom click ''Open'' under ''Existing''. Next, navigate to the ''Downloads'' folder and click on ''Linux-Hadoop-Minimal-V2.0-M8.2.utm'' The LHM should now be listed in the left hand column in the UTM window.

**Step 5:** Start the LHM by clicking on the big arrow in the middle of the window. (If it is not highlighted, click on the ''Linux-Hadoop-Minimal-V2.0-M8.2'' on the left) The LHM should start. A new "terminal" window will open and show the VM boot-up sequence. When finished, a prompt will be displayed in this window (''localhost login:'') Minimize this terminal window. We will log into the LHM using the Mac terminal program. 

**Note 1:** If you want to run X-Windows applications you will need to install the [[https://www.xquartz.org/|XQuartz]] X-Windows library package. 

**Note 2:** The LHM is configured to use 4-cores and 6G of memory. This should leave enough resources to run other application on the host. If you are having difficult with running applications while the LHM is running, you can stop the LHM and the 6G of memory and 4 cores it was using will be released.

**IMPORTANT** The LHM should be stopped when not in use. The running services should be stopped in a ''graceful manner'' (powered down) as frequent standby or a sudden power interruption can leave some services in a broken state. To stop the LHM, it is suggested the you login as the root user (see [[linux_hadoop_minimal_installation_instructions_version_2#stopping_the_vm|Stopping the VM]] ) and issue the ''poweroff'' command. This will ensure a safe and orderly shutdown of the machine.


====Check the LHM from Your Local Machine====

Once the LHM is up and running, use an //ssh client// to connect and verify all the services are running.
This procedure is the same for both Intel-x86_64 and Apple-M1 based systems.

It is possible to login and use the sandbox from the VirtualBox or UTM terminal, however, you will have much more flexibility with local terminals. Follow the instructions below for local terminal access.

As described above, on **Windows systems** use Putty or MobaXterm to start a terminal connection. See the videos above for a full explanation of this process.

On **Mac and Linux** systems, open a text terminal (under //Aplications\Utilities\Terminal//) and connect to the LHM as the root user using the ''ssh'' command below. Macintosh and Linux machines have ''ssh'' and Terminal application installed by default, 
  
The root password is: **hadoop** on Intel-x86_64 based systems \\
The root password is: **hadoop2023 ** on Apple-M1  based systems

  ssh root@127.0.0.1 -p 2222

You are should now be in the ''/root'' directory

To confirm all the Hadoop daemons have started enter ''jps'' as root The results should list at least 16 daemons as shown below (Ignore Jps entry, process numbers and order will be different) 

<code>
# jps
2625 QuorumPeerMain
1666 HistoryServer
1701 NameNode
4006 ApplicationHistoryServer
2571 SecondaryNameNode
3755 NodeManager
14412 Jps
2989 DataNode
4174 JobHistoryServer
1295 RunJar
1296 RunJar
3473 ResourceManager
3315 Kafka
3351 HMaster
1656 ZeppelinServer
1306 NetworkServerControl
3550 HRegionServer
</code>

When finished, exit from the root account.

  # exit

====User Account====

As configured, the LHM comes with one general user account. The account is called:

  hands-on
  
and the password is 

  minimal
  
The login command (to get from you local machine to the LHM as user "hands-on")

  ssh hands-on@127.0.0.1 -p 2222

**It is highly recommended that this account be used for the class examples.** Remember you need to be user ''hdfs'' to do any administrative work in HDFS and running as user ''hdfs'' gives you full ''root'' control of the HDFS file system. The ''hdfs'' account has no active password. To become the ''hdfs'' user, log in as root and issue a ''su - hdfs'' command.

There is also a user account called "**nifi**" that is used by the NiFi tool. The password is "**nifiLHM**". 


====Copying Files In and Out of the Virtual Machine====

To copy a file from your LOCAL MACHINE into the VM, use the ''scp'' command. For instance, to copy the file ''SOURCE-FILE'' from your local directory on your ''LOCAL MACHINE'' to the "**hands-on**" account. The password is "**minimal**" and the command places file in ''/home/hands-on'' directory in the VM.

  scp -P2222  SOURCE-FILE  hands-on@127.0.0.1:/home/hands-on

To be clear, the above command is run on your ''LOCAL MACHINE''. On Macintosh and Linux systems run this from a terminal. On Windows run it from MobaXterm.

To copy a file from the VM to your ''LOCAL MACHINE'' and place it in your current directory use the following. (don't forget the ''.''):

  scp -P2222 hands-on@127.0.0.1:/home/hands-on/SOURCE-FILE .

To be clear, the above command is run on your ''LOCAL MACHINE''.

On Windows, the data will be placed in the MobaXterm "Persistent Home Directory." In the case of Windows 10 with user "Doug" this would be the following:

  C:\Users\Doug\Documents\MobaXterm\home

====General Usage Notes====

The Linux Hadoop Minimal VERSION 2 includes the following Anaconda and Apache software. 
<code>
CentOS Linux 7.6 minimal
Anaconda 3: Python 3.7.4
Apache Hadoop 3.3.0
Apache Hive 3.1.2
Apache Spark 2.4.5
Apache Derby 10.14.2.0
Apache Zeppelin 0.8.2
Apache Sqoop-1.4.7
Apache Kafka 2.5.0
Apache NiFi 1.17.0 
Apache HBase 2.4.10
</code>
Anaconda Python is the default for all users.

2. The Linux Hadoop Minimal has been tested with VirtualBox on Linux, MacOS 10.12-11.3, and Windows 10 Home addition. It has not been tested with VMware.

3. The Linux Hadoop Minimal Virtual Machine is designed to work on minimal hardware. **It is recommended at a MINIMUM your system have an x86_64 processor with at least 4 cores/threads, 4+ GB memory, and 70G of disk space.** The VM is set to use 3G of system memory. This will cause some applications to swap to disk, but it should allow the virtual machine to run on a 4GB laptop/desktop. (If you are thinking of using the Cloudera/Hortonworks sandbox then 4+ cores and 16+ GB of memory is recommended)

4. The above packages have not been fully tested although all of the examples from the course should work.


====Adding Users====

To add yourself as a user with a different user name follow the following steps.

**Step 1.** As root do the following to create a user and add a password:

<code>
useradd -G hadoop USERNAME
passwd USERNAME
</code>

**Step 2.** These steps change to user hdfs and create the user directory in HDFS (as root)

<code>
su - hdfs
hdfs dfs -mkdir /user/USERNAME
hdfs dfs -chown USERNAME:hadoop /user/USERNAME
exit
</code>

**Step 3.** Logout and login to the new account

====Web UI Access====

The various web interfaces shown in class are available using the following URLs. Enter the desired
URL in you **local browser** and the VM should respond.
<code>
HDFS web interface:       http://127.0.0.1:50070
YARN Jobs web Interface:  http://127.0.0.1:8088
Zeppelin Web Notebook:    http://127.0.0.1:9995
Spark History Server:     http://127.0.0.1:18080
NiFi:                     http://127.0.0.1:18085
</code>

The Zeppelin interface is not configured with a login (i.e. it is run in anonymous mode without the need to log-in). The "Zeppelin Tutorial/Basic Features" notebook used in class works as does some of the ''SparkR''notebooks.

===Using Nifi===

Due to the large amount of resources used by NiFi, the Nifi web interface is not automatically started when the LHM starts. To start the NiFi web UI, run the following as user root:

  # systemctl start nifi
  
Be patient, the NIF UI (located at ''http://127.0.0.1:18085'') may take several minutes to start. To stop the Nifi interface, enter (as root):

  # systemctl stop nifi
  
It may also be useful to stop some of the services you do not intend to use with Nifi. For instance, stopping HBase and Kafka (''systemctl stop hbase; systemctl stop kafka'') will free up resources and help Nifi run better. Don't forget to stop Nifi and restart these services, if you need them.
==== Getting Data into Zeppelin====

If you want to load you own data into a Zeppelin notebook, place the data in the zeppelin account under ''/home/zeppelin''. Login as root to place data in this account then change the ownership to zeppelin user for example:

  # cp DATA /home/zeppelin
  # chown zeppelin:hadoop /home/zeppelin/DATA

This location is the default path for the Zeppelin interpreter (run ''pwd'' in the ''%sh'' interpreter).

==== Database for Sqoop Example====

MariaDB (MySQL)has been installed in the VM. The World database used with the Sqoop example in the class has been preloaded into MySQL. SQL login and password for the Sqoop database is **sqoop** and **sqoop**

====Log Files====

** The log management in V2.0-beta is has not been fully configured**
There is currently no logfile management and log directly may fill up and use the sandbox storage.
There is a ''clean-logs.sh'' script in ''/root/Hadoop-Minimal-Install-Notes/Hadoop-Hive/scripts''
This script will remove most of the Hadoop/Spark and system logs (somewhat aggressive)

====Stopping and Starting the Hadoop Daemons====

The Hadoop Daemons are started and stopped using systemd. 

  * hadoop.service - Starts HDFS (NameNode, SNN, and DataNode), YARN (ResourceManager, NodeManager, TimelineServer)
  * derby.service - Used to store Hive metadata
  * hive-metastore.service - Manages Hive metadata
  * hiveserver2.service - Provides access to Hive tables (for Spark)
  * kafka.service - Start Kafka server
  * zeppelin.service - Starts the Zeppelin WebUI
  * spark-history.service - Web UI for Spark history
  * hbase.service - start HBase database
  * nifi.service -Start the Nifi WebUI

All services are started when the LHM starts. Each service can be started, stopped, and checked using ''systemctl''. For example, to start, stop and check status of the Hadoop service use:
  
  # systemctl start hadoop
  # systemctl status hadoop
  # systemctl stop hadoop

As mentioned, if all the the scripts are running, the ''jps'' command
(run as root) should show the following (process numbers will be different).
The RunJar entrees are for the ''hiveserver2'' and ''hive-metastore'' processes.

<code>
2625 QuorumPeerMain
1666 HistoryServer
1701 NameNode
4006 ApplicationHistoryServer
2571 SecondaryNameNode
3755 NodeManager
14412 Jps
2989 DataNode
4174 JobHistoryServer
1295 RunJar
1296 RunJar
3473 ResourceManager
3315 Kafka
3351 HMaster
1656 ZeppelinServer
1306 NetworkServerControl
3550 HRegionServer
</code>

A local metadata database (called Derby) is needed for Hive, if
the ''NetworkServerControl'' daemon is not running, then stop and restart
the derby daemon:

  # systemctl restart derby

Spark can use Hive tables through a hive-metastore and hiveserver2 service. To stop and restart the services (in the following order)

  # systemctl restart  hive-metastore
  # systemctl restart hiveserver2

Finally, if the Zeppelin web page cannot be reached, the Zeppelin daemon
may not be running. Stop and restart the daemon:

   # systemctl restart zeppelin

If any or all of the daemons will not start after the above procedure
then the is a bigger issue with the VM. Please contact Eadline
and describe the situation.

The scripts used to stop and start the services are located in ''/opt/services''. Under normal operation, these scripts should not have to be run "by hand." 

====Stopping the VM====


**VirtualBox**
To stop the LHM, click on "machine" in the VirtualBox menu bar. Select "Close" and then
select the "Save State" option. The next time the machine starts it will have all the
changes you made.


**VirtualBox and UTM** (recommended)

Alternatively, you can shutdown the LHM from within the LHM using the following command (provide the ''hands-on'' account password "''minimal''"):

  # sudo poweroff

And the VM will gracefully shutdown the all the services and preserve any changes you made. 


====VM Installation Documentation====

Please see ''/root/Hadoop-Minimal-Install-Notes'' directory in the VM for how the packages were installed.

====Issues/Bugs====

These issues have been addressed in the current version of the VM. Please use the lasted VM and you can avoid these issues.

1. Excessive ERROR and INFO messages with ''hive''. In LHM version V2 Beta-8 (Aug-08-2022) the is an error in the ''/opt/apache-hive-3.1.2-bin/conf/hive-env.sh'' file. To fix the error run the following commands as **root**

  bin/cp /opt/apache-hive-3.1.2-bin/conf/hive-env.sh /opt/apache-hive-3.1.2-bin/conf/hive-env.sh.orig
  grep -v hadoop-3.3.0 /opt/apache-hive-3.1.2-bin/conf/hive-env.sh >/opt/apache-hive-3.1.2-bin/conf/hive-env.sh.fix1
  /bin/cp /opt/apache-hive-3.1.2-bin/conf/hive-env.sh.fix1 /opt/apache-hive-3.1.2-bin/conf/hive-env.sh
  chown hive:hadoop   /opt/apache-hive-3.1.2-bin/conf/*


2. If you have **problems loading the OVA image into VirtualBox**, check the MD5 signature of the OVA file. The MD5 signature returned by running the program below should match the signature provided [[https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Linux-Hadoop-Minimal-V2.0-beta.MD5.txt|here]]. For each OS, use the following commands (note the name of the OVA file may be different):

For **Linux** use "md5sum"

  $ md5sum Linux-Hadoop-Minimal-V2.0-beta1.ova

For **Macintosh** use "md5"

  $ md5 Linux-Hadoop-Minimal-V2.0-beta1.ova

For **Windows 10** (in PowerShell) use "Get-FileHash" (Also, note the use of uppercase)

  C:\Users\Doug> Get-FileHash .\Linux-Hadoop-Minimal-V2.0-beta1.ova -Algorithm MD5
  
  
3. If the **time on the LHM falls out of sync** with the host due to hibernation, the following commands can be run to reset the the ntpd time daemon (run as root). **NOTE:** the host must have Internet access.

    systemctl -l stop ntpd
    ntpdate -u pool.ntp.org
    systemctl -l start ntpd
    
To check on the current state of time synchronization run ''ntpq -pn'' to list the current "peers" that supply time. A "*" means it is actively synchronized with that peer. No "*" means not synchronized. If the difference between the system time and peer time is too great the ntpd daemon may stop (or never reach synchonization). If the ntpd daemon will not synchronize, perform the above. Example output from ''ntpq -pn'' for a time synchronized LHM is as follows.
 
<code>
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*108.61.73.243   209.51.161.238   2 u   68   64  377   14.452    1.052   0.291
+129.250.35.250  204.2.140.74     2 u    2   64  377   13.251    0.048   0.885
+138.236.128.36  69.89.207.199    2 u   12   64  377   50.255    1.710   1.378
-162.159.200.123 10.106.8.9       3 u    1   64  377   19.879    0.160   0.368
</code>

4. **Do not use the root account to run examples**. Either create your own user account as described above or use the existing "hands-on" user account. The examples will not work if run as the root account.
     
5. **In old versions there is a permission issue in HDFS** that prevents Hive jobs from working. To fix it, perform the following steps:

a) login to the VM as root (pw="hadoop")

     ssh root@127.0.0.1 -p 2222

b) then change to hdfs user

     su - hdfs

c) fix the permission error:

     hdfs dfs -chmod o+w /user/hive/warehouse

d) Check the result

     hdfs dfs -ls /user/hive

e) The output of the previous command should look like:

     Found 1 items
     drwxrwxrwx   - hive hadoop          0 2019-01-24 20:43 /user/hive/warehouse

f) Exit out of the hdfs account

     exit

g) exit out the root account

     exit

You should now be back at the terminal on your laptop/desktop