Beta Version is posted. Basic functionality has been tested.
Date: Jan 26, 2021
Author: Douglas Eadline Email:
deadline(you know what goes here)
Hardware Requirements: To run the VM you will need, an x86_64 processor with A MINIMUM of 4 cores/threads, 4 GB memory, 70G of disk space, and support for HW virtualization.
Unless otherwise noted, all course content, notes, and examples are © Copyright Limulus Computing, Douglas Eadline 2021, All rights reserved.
The Linux Hadoop Minimal is a virtual machine (VM) that can be used to try the examples presented in the following on-line courses entitled:
It can also be used for the examples provided in the companion on-line video tutorial (14+ hours):
The machine has many important Hadoop and Spark packages installed and at the same time tries to keep the resource usage as low as possible so the VM can used on most laptops. (See below for resource recommendations)
To learn more about the course and my other analytics books and videos, go to:
PLEASE NOTE: This version of Linux Hadoop Minimal (LHM) is still considered “beta.” If you use it and find problems, please send any issues to deadline(you know what goes here)basement-supercomputing.com with “LHM” in the subject line.
If you have taken the “Hands-on” course mentioned above, you can download the
NOTES.txt files, examples, and data archive directly to the VM using
wget The archive is in both compressed tar (tgz) and Zip (zip) format. It is recommended that you either make a new user account or use the “hands-on” account for the archive (and run most of the examples from this account).
For instance, to download and extract the archive for the “Hands-on” course from within the VM:
wget --no-check-certificate https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Hands_On_Hadoop_Spark-V1.5.1.tgz tar xvzf Hands_On_Hadoop_Spark-V1.5.tgz
Similarly, for the “Linux Command Line” course (do this within the VM)
wget --no-check-certificate https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Linux-Command-Line-V1.0.tgz tar xvzf Linux-Command-Line-V1.0.tgz
If you want to move files from your local machine to the VM, then you can use
on your host. (
scp natively available on Linux and Macintosh systems, it is part of the
MobaXterm package on Windows)
scp -P2222 SOURCE-FILE USERNAME@127.0.0.1:PATH
USERNAME is a valid account on the VM. There is a user account called
hands-on that can
be used for most of the examples. Therefore, the command to copy file (
SOURCE-FILE) from your
host system to the VM is (it places the file in
/home/hands-on in the VM):
scp -P2222 SOURCE-FILE email@example.com:/home/hands-on
See the Connect From Your Local Machine to the LHM Sandbox below for more information
The Linux Hadoop Minimal VERSION 2 includes the following Apache software.
CentOS Linux 7.6 minimal Anaconda 3: Python 3.7.4 Apache Hadoop 3.3.0 Apache Hive 3.1.2 Apache Spark 2.4.5 Apache Derby 10.14.2.0 Apache Zeppelin 0.8.2 Apache Sqoop-1.4.7 Apache Kafka 2.5.0
Anaconda Python is the default for all users.
2. The Linux Hadoop Minimal has been tested with VirtualBox on Linux, MacOS 10.12, and Windows 10 Home addition. It has not been tested with VMware.
3. The Linux Hadoop Minimal Virtual Machine is designed to work on minimal hardware. It is recommended at a MINIMUM your system have an x86_64 processor with at least 4 cores/threads, 4 GB memory, and 70G of disk space. The VM is set to use 3G of system memory. This will cause some applications to swap to disk, but it should allow the virtual machine to run on a 4GB laptop/desktop. (If you are thinking of using the Cloudera/Hortonworks sandbox then 4+ cores and 16+ GB of memory is recommended)
4. The above packages have not been fully tested although all of the examples from the course should work.
Step 1: Download and install VirtualBox for your environment. VirtualBox is freely available. Note: Some windows environments may need the Extension Pack. See the Virtual Box Web Page.
Step 2: Follow the installation instructions for your Operating System environment. For Red Hat based systems this page, https://tecadmin.net/install-oracle-virtualbox-on-centos-redhat-and-fedora, is helpful. With Linux there is some dependencies on kernel versions and modules that need to be addressed.
If you are using Windows, you will need an “ssh client.” Either of these will work. They are both freely available at no cost. (MobaXterm is recommended)
Step 3: Make sure hardware virtualization is enabled in your BIOS.
Step 3a: On Mac Systems with Big Sur, you may get the
Kernel driver not installed (rc=-1908) error. The error is do to new security levels in MacOS. See this page for a fix.
Step 4: Download the https://www.clustermonkey.net/download/Hands-on_Hadoop_Spark/Linux-Hadoop-Minimal-V2.0-beta6.ova image and load it into VirtualBox. (NOTE newer version may be available.)
Step 5: Start the VM. All the essential Hadoop service should be started automatically.
HINT: If your laptop (desktop) has more than 4GB or memory, you can increase the amount of memory for the LHM virtual machine. Before the LHM is started, go to the VirtualBox GUI, select the LHM, then select
Settings/System. Use the
Base Memory slider to add more memory to the LHM. Make sure you leave enough for the base operating system (Windows, MacOS, Linux) to run.
It is possible to login and use the sandbox from the VirtualBox terminal, however, you will have much more flexibility with local terminals. Follow the instructions below for local terminal access.
As a test, open a text terminal and connect to the sandbox as the root user with
ssh. Macintosh and
Linux machines have
ssh and terminal installed, for windows see above (Putty or MobaXterm) or this document:
The root password is: hadoop
ssh firstname.lastname@example.org -p 2222
You are should now be in the
To confirm all the Hadoop daemons have started enter
jps as root The results should list the 12 daemons as shown below (Ignore Jps entry, process numbers and order will be different)
# jps 2245 QuorumPeerMain 3238 JobHistoryServer 3048 ApplicationHistoryServer 2153 SecondaryNameNode 2473 DataNode 2826 ResourceManager 2956 NodeManager 1165 RunJar 1294 ZeppelinServer 1167 NetworkServerControl 1168 RunJar 2873 Kafka 3772 Jps 1502 NameNode
To copy a file from your LOCAL MACHINE into the VM, use the
scp command. For instance, to copy the file
SOURCE-FILE from your local directory on your
LOCAL MACHINE to the “hands-on” account. The password is “minimal” and the command places file in
/home/hands-on directory in the VM.
scp -P2222 SOURCE-FILE email@example.com:/home/hands-on
To be clear, the above command is run on your
LOCAL MACHINE. On Macintosh and Linux systems run this from a terminal. On Windows run it from MobaXterm.
To copy a file from the VM to your
LOCAL MACHINE and place it in your current directory use the following. (don't forget the
scp -P2222 firstname.lastname@example.org:/home/hands-on/SOURCE-FILE .
To be clear, the above command is run on your
On Windows, the data will be placed in the MobaXterm “Persistent Home Directory.” In the case of Windows 10 with user “Doug” this would be the following:
As configured, the LHM comes with one general user account. The account is called:
and the password is
The login command (to get from you local machine to the LHM as user “hands-on”)
ssh email@example.com -p 2222
It is highly recommended that this account be used for the class examples. Remember you need to be user
hdfs to do any administrative work in HDFS and running as user
hdfs gives you full
root control of the HDFS file system. The
hdfs account has no active password. To become the
hdfs user, log in as root and issue a
su - hdfs command.
To add yourself as a user with a different user name follow the following steps.
Step 1. As root do the following to create a user and add a password:
useradd -G hadoop USERNAME passwd USERNAME
Step 2. These steps change to user hdfs and create the user directory in HDFS (as root)
su - hdfs hdfs dfs -mkdir /user/USERNAME hdfs dfs -chown USERNAME:hadoop /user/USERNAME exit
Step 3. Logout and login to the new account
The various web interfaces shown in class are available using the following URLs. Enter the desired URL in you local browser and the VM should respond.
HDFS web interface: http://127.0.0.1:50070 YARN Jobs web Interface: http://127.0.0.1:8088 Zeppelin Web Notebook: http://127.0.0.1:9995
The Zeppelin interface is not configured with a login (i.e. it is run in anonymous mode without the need to log-in). The “Zeppelin Tutorial/Basic Features” notebook used in class works as does some of the
If you want to load you own data into a Zeppelin notebook, place the data in the zeppelin account under
/home/zeppelin. Login as root to place data in this account then change the ownership to zeppelin user for example:
# cp DATA /home/zeppelin # chown zeppelin:hadoop /home/zeppelin/DATA
This location is the default path for the Zeppelin interpreter (run
pwd in the
MariaDB (MySQL)has been installed in the VM. The World database used with the Sqoop example in the class has been preloaded into MySQL. SQL login and password for the Sqoop database is sqoop and sqoop
The log management in V2.0-beta is has not been fully configured
There is currently no logfile management and log directly may fill up and use the sandbox storage.
There is a
clean-logs.sh script in
This script will remove most of the Hadoop/Spark and system logs (somewhat aggressive)
The Hadoop Daemons are started and stopped using systemd.
All services are started when the LHM starts. Each service can be started, stopped, and checked using
systemctl. For example, to start, stop and check status of the Hadoop service use:
# systemctl start hadoop # systemctl status hadoop # systemctl stop hadoop
As mentioned, if all the the scripts are running, the
(run as root) should show the following (process numbers will be different).
The RunJar entrees are for the
2114 QuorumPeerMain 2248 DataNode 1032 NetworkServerControl 1161 ZeppelinServer 1036 RunJar 1037 RunJar 2800 NodeManager 3924 Jps 2613 ResourceManager 2808 Kafka 1373 NameNode 1981 SecondaryNameNode 2909 ApplicationHistoryServer
A local metadata database (called Derby) is needed for Hive, if
NetworkServerControl daemon is not running, then stop and restart
the derby daemon:
# systemctl restart derby
Spark can use Hive tables through a hive-metastore and hiveserver2 service. To stop and restart the services (in the following order)
# systemctl restart hive-metastore # systemctl restart hiveserver2
Finally, if the Zeppelin web page cannot be reached, the Zeppelin daemon may not be running. Stop and restart the daemon:
# systemctl restart zeppelin
If any or all of the daemons will not start after the above procedure then the is a bigger issue with the VM. Please contact Eadline and describe the situation.
The scripts used to stop and start the services are located in
/opt/services. Under normal operation, these scripts should not have to be run “by hand.”
To stop the VM, click on “machine” in the VirtualBox menu bar. Select “Close” and then select the “Save State” option. The next time the machine starts it will have all the changes you made.
Alternatively, you can enter (as root user) from within the VM:
And the VM will gracefully shutdown the Hadoop/Spark services and preserve any changes you made.
/root/Hadoop-Minimal-Install-Notes directory in the VM for how the packages were installed.
These issues have been addressed in the current version of the VM. Please use the lasted VM and you can avoid these issues.
1. If you have problems loading the OVA image into VirtualBox, check the MD5 signature of the OVA file. The MD5 signature returned by running the program below should match the signature provided here. For each OS, use the following commands (note the name of the OVA file may be different):
For Linux use “md5sum”
$ md5sum Linux-Hadoop-Minimal-V2.0-beta1.ova
For Macintosh use “md5”
$ md5 Linux-Hadoop-Minimal-V2.0-beta1.ova
For Windows 10 (in PowerShell) use “Get-FileHash” (Also, note the use of uppercase)
C:\Users\Doug> Get-FileHash .\Linux-Hadoop-Minimal-V2.0-beta1.ova -Algorithm MD5
2. Either create your own user account as described above or use the existing “hands-on” user account. The examples will not work if run as the root account.
3. In previous versions there is a permission issue in HDFS that prevents Hive jobs from working. To fix it, perform the following steps:
a) login to the VM as root (pw=“hadoop”)
ssh firstname.lastname@example.org -p 2222
b) then change to hdfs user
su - hdfs
c) fix the permission error:
hdfs dfs -chmod o+w /user/hive/warehouse
d) Check the result
hdfs dfs -ls /user/hive
e) The output of the previous command should look like:
Found 1 items drwxrwxrwx - hive hadoop 0 2019-01-24 20:43 /user/hive/warehouse
f) Exit out of the hdfs account
g) exit out the root account
You should now be back at the terminal on your laptop/desktop