Operationalizing a Hadoop Eco-System (Part 1: Installing & Configuring a 3-node Cluster)

hadoop eco-systemThe objective of DataTechBlog is to bring the many facets of data, data tools, and the theory of data to those curious about data science and big data.  The relationship between these disciplines and data can be complex.  However, if careful consideration is given to a tutorial, it is a practical expectation that the layman can be brought online quickly.  With that said, I am extremely excited to bring this tutorial on the Hadoop Eco-system.  Hadoop & MapReduce (at a high level) are not complicated ideas.  Basically, you take a large volume of data and spread it across many servers (HDFS).  Once at rest, the data can be acted upon by the many CPU’s in the cluster (MapReduce).  What makes this so cool is that the traditional approach to processing data (bring data to cpu) is flipped.  With MapReduce, CPU is brought to the data.  This “divide-and-conquer” approach makes Hadoop and MapReduce indispensable when processing massive volumes of data.  In part 1 of this multi-part series, I am going to demonstrate how to install, configure and run a 3-node Hadoop cluster.  Finally, at the end I will run a simple MapReduce job to perform a unique word count of Shakespeare’s Hamlet.  Future installments of this series will include topics such as: 1. Creating an advanced word count with MapReduce, 2. Installing and running Hive, 3. Installing and running Pig, 4. Using Sqoop to extract and import structured data into HDFS.  The goal is to illuminate all the popular and useful tools that support Hadoop.

Photo via

Assumptions
– This blog post assumes that you are comfortable working in a linux environment
– This blog post requires that you have three CentoOS (or Redhat) VM’s created and pre-configured.  If you want instruction on how to create VMs then please see my blog post on this very topic: Creating a Virtualized Environment.
The VMs should be configured identically and as such:

Memory – 4G
CPU Cores – 1
Swap – 2G
Tempfs – 2G
root file system “/” – 18G (ext4)
CentOS 6.5 (or 6.4) x64
VMs must have internet access


Tidbits

The three VMs I created are listed here and they will be referenced by these names and IPs throughout this post. Your VMs most likely will have different IPs. The server names might differ as well.  However I do suggest keeping the names the same to avoid confusion.

Master Node => hadoopm1: 192.168.116.101
Slave Node 1 => hadoops1: 192.168.116.102
Slave Node 2 => hadoops2: 192.168.116.103

– This project does require that your workstation be robust. I did this work on a Windows 7 x64 machine that has 1 TB of storage and 24G of RAM.
– It needs to be noted that this installation of Hadoop is of the basic variety.  No consideration is to be given to security or performance within the cluster. Those topics are advanced and fall outside the purview of this post.
– Be conscious that some commands wrap due to blog templating. An example of this is step 6 below
– “#” indicates that the command is run as “root”
– “$” indicates that the command is run as “hduser”
All linux commands are highlighted in grayish blue
All output is highlighted in orange

Index
1. Installing Java
2. Installing SSH
3. Create user account to own Hadoop install
4. Modify /etc/hosts
5. Configure key based logins
6. Download and install Hadoop 2.2.0
7. Modify the .bash_profile for O.S. user “hduser”
8. Configure Hadoop: core-site.xml
9. Configure Hadoop: hdfs-site.xml
10. Configure Hadoop: mapred-site.xml
11. Configure Hadoop: yarn-env.sh
12. Configure Hadoop: yarn-site.xml
13. Configure Hadoop: slaves file
14. Configure Hadoop: mapred-env.sh
15. Configure Hadoop: hadoop-env.sh
16. Copy Hadoop install to slave nodes
17. Initializing Hadoop
18. Testing HDFS
19. Running MapReduce
20. Hadoop UIs
21. Giving Credit
22. Next Steps

Let’s Begin


1.  Log in as “root” on hadoopm1 and test for the version of Java that is installed.

# java  -version
-bash: java: command not found

When I created my VMs I did not install Java. If you have Java installed just make sure you are at rev 1.7.0.
Using YUM install the Java SDK

# yum  install  java-1.7.0-openjdk.x86_64

If all goes well, and after along list of output, you will be prompted with

“Complete!”

Run the Java command to see that Java has successfully installed

# java  -version
java version “1.7.0_51”
OpenJDK Runtime Environment (rhel-2.4.4.1.el6_5-x86_64 u51-b02)
OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)

Perform all of Step 1 again on each slave node: hadoops1, hadoops2.


2.  As “root” on hadoopm1, test to check that SSH is installed.

# ssh  -V
OpenSSH_5.3p1, OpenSSL 1.0.0-fips 29 Mar 2010

SSH is almost always installed by default and you should see an output similar to mine above. If you do not then you need to install SSH Server and SSH Client

# yum install openssh-server.x86_64
# yum install openssh-clients.x86_64

Make sure that SSH is running.  If it is not running, start it.

# /etc/init.d/sshd status
openssh-daemon (pid 6227) is running…

If your output does not say running, then you need to start it.

# /etc/init.d/sshd start

Perform all of Step 2 again on each slave node: hadoops1, hadoops2.


3.  As “root” on hadoopm1, create dedicated group/account for running Hadoop

# groupadd hadoop
# useradd -g hadoop hduser
# passwd hduser

Security is not a concern so I chose a password that was easy to remember

Perform all of Step 3  again on each slave node: hadoops1, hadoops2.


4.  As “root” on hadoopm1, modify /etc/hosts
– Make a backup of /etc/hosts before modifying it

# cp  /etc/hosts  /etc/hosts.orig

– Using your favorite editor open /etc/hosts and add the following lines

127.0.0.1 localhost
192.168.116.101 hadoopm1 hadoopm1
192.168.116.102 hadoops1 hadoops1
192.168.116.103 hadoops2 hadoops2

– The first line of the /etc/hosts file will contain an entry for “localhost.” Ensure that this entry does not reference the server name. This entry should look like:

127.0.0.1 localhost

– This is important for the proper use of the Web UI’s that come with Hadoop.

Perform all of Step 4 again on each slave node: hadoops1, hadoops2.


5.  As “hduser” on hadoopm1, configure key based logins

# su – hduser

Do not provide a passphrase when prompted for one in the steps below.  Just press the ENTER key.

$ ssh-keygen -t  rsa
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory ‘/home/hduser/.ssh’.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:

Add the entries for each server to the key file.

$ ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@hadoopm1
Are you sure you want to continue connecting (yes/no)? yes

You will prompted for the “hususer” password.

$ ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@hadoops1
Are you sure you want to continue connecting (yes/no)? yes

You will prompted for the “hususer” password.

$ ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@hadoops2
Are you sure you want to continue connecting (yes/no)? yes

You will prompted for the “hdsuser” password.

Perform all of Step 5 again on each slave node: hadoops1, hadoops2.

Each node should now be accessible from every node. To test connect (as hduser) to haddopm1 from hadoopm1

$ ssh hadoopm1

You should be brought to a “$” on the hadoopm1 node and you should not have been prompted for a password.
Exit from from that prompt and do the same test with hadoops1, and hadoops1.

Run the same tests from each of the slave nodes. In all you should have run 9 tests.


6. Download and install Hadoop 2.2.0
Log onto hadoopm1 as root and run the following commands:

# mkdir -p  /opt/hadoop
# cd  /opt/hadoop
# wget  http://apache.mesi.com.ar/hadoop/common/stable2/hadoop-2.2.0.tar.gz
# tar -xzf  hadoop-2.2.0.tar.gz
# mv  hadoop-2.2.0  hadoop
# chown -R  hduser:hadoop  /opt/hadoop

Don’t worry about the other nodes just yet. This step is for hadoopm1 only.


7. Modify .bash_profile for O.S. user “hduser”
Switch to “hduser” on hadoopm1

# su – hduser
Using your favorite editor open the .bash_profile file for “hduser” and modify the file to include

Add JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.51.x86_64/jre
Add PATH=$PATH:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/jre/bin
Add HADOOP_COMMON_LIB_NATIVE_DIR=/opt/hadoop/hadoop/lib/native
Add HADOOP_OPTS=”-Djava.library.path=/opt/hadoop/hadoop/lib”
export JAVA_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR
export HADOOP_OPTS

Your minor numbers of the Java version may differ; adjust accordingly.

Perform all of Step 7 again on each slave node: hadoops1, hadoops2.


8. Configure Hadoop: core-site.xml
To get the cluster up and running there are four files that need to be configured. I start with core-site.xml.

core-site.xml

The following commands are run as the user ‘hduser’

$ cd  /opt/hadoop/hadoop/etc/hadoop
Per usual we make backup copies of all files we are to edit

$ cp  core-site.xml  core-site.xml.orig

Open core-site.xml for edit with your editor of choice. You will see an empty “configuration” tag pair. Edit this tag pair to include the following:

hadoop eco-system


9. Configure Hadoop: hdfs-site.xml
Make a backup of the file.

$ cd /opt/hadoop/hadoop/etc/hadoop
$ cp hdfs-site.xml hdfs-site.xml.orig

Open hdfs-site.xml for edit with your editor of choice. You will see an empty “configuration” tag pair. Edit this tag pair to include the following:

hadoop eco-system

10. Configure Hadoop: mapred-site.xml
This step has many parts so pay close attention here.
First we need to create two new directories on each of the three nodes.
Because we have Key based logins we can do it all quite easily from one node.
Starting on hadoopm1

$ mkdir -p /opt/hadoop/mapred/temp
$ mkdir -p /opt/hadoop/mapred/local
$ mkdir -p /opt/hadoop/nodemanager
$ mkdir -p /opt/hadoop/yarn

Jump onto slave node 1 and run the commands

$ ssh hadoops1
$ mkdir -p /opt/hadoop/mapred/temp
$ mkdir -p /opt/hadoop/mapred/local
$ mkdir -p /opt/hadoop/nodemanager
$ mkdir -p /opt/hadoop/yarn
$ exit

Jump onto slave node 2 and run our commands

$ ssh hadoops2
$ mkdir -p /opt/hadoop/mapred/temp
$ mkdir -p /opt/hadoop/mapred/local
$ mkdir -p /opt/hadoop/nodemanager
$ mkdir -p /opt/hadoop/yarn
$ exit

The file mapred-site.xml is actually created from the template mapred-site.xml.template

$ cd /opt/hadoop/hadoop/etc/hadoop
$ cp  mapred-site.xml.template  mapred-site.xml

Open mapred-site.xml for edit with your editor of choice. You will see an empty “configuration” tag pair. Edit this tag pair to include the following:

hadoop eco-system


11. Configure Hadoop: yarn-env.sh
Make a backup of the file

$ cd /opt/hadoop/hadoop/etc/hadoop

Make a backup copy of all files we are to edit

$ cp  yarn-env.sh  yarn-env.sh.orig

Using your favorite editor open yarn-env.sh for edit.
modify the file to include:

export YARN_CONF_DIR=/opt/hadoop/hadoop/etc/hadoop
export YARN_LOG_DIR=/opt/hadoop/yarn


12. Configure Hadoop: yarn-site.xml
Make a backup of the file

$ cd /opt/hadoop/hadoop/etc/hadoop
$ cp yarn-site.xml yarn-site.xml.orig

Open yarn-site.xml for edit with your editor of choice. You will see an empty “configuration” tag pair. Edit this tag pair to include the following:

hadoop eco-system


13. Configure Hadoop: slaves file
Make a backup of the file

$ cd /opt/hadoop/hadoop/etc/hadoop
$ cp slaves slaves.orig

Open slaves for edit with your editor of choice. Add to it the slave node server names:

hadoops1
hadoops2

14. Configure Hadoop: mapred-env.sh
Make a backup of the file

$ cd /opt/hadoop/hadoop/etc/hadoop
$ cp  mapred-env.sh  mapred-env.sh.orig

Open mapred-env.sh for edit with your editor of choice. Add to it the following:

export  JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.51.x86_64/jre

15. Configure Hadoop: hadoop-env.sh
Make a backup of the file

$ cd /opt/hadoop/hadoop/etc/hadoop
$ cp hadoop-env.sh hadoop-env.sh.orig

Open hadoop-env.sh for edit with your editor of choice. Add to it the following:

export  JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.51.x86_64/jre
export  HADOOP_CONF_DIR=/opt/hadoop/hadoop/etc/hadoop
— this line  may already be present, check before adding
export  HADOOP_OPTS=-Djava.net.preferIPv4Stack=true


16. Copy Hadoop install to slave nodes
Working from hadoopm1

$cd  /opt/hadoop
$ scp  -r  hadoop hadoops1:/opt/hadoop
$ scp  -r  hadoop hadoops2:/opt/hadoop

What I just did was to copy the entire (configured) Hadoop install directory to each of the slave nodes.

17. Initializing Hadoop
Working from hadoopm1

$cd  /opt/hadoop/hadoop/bin
$ ./hadoop  namenode  -format

Start the cluster

$ cd  /opt/hadoop/hadoop/sbin
$./ start-all.sh

18. Testing HDFS
Create directory on master node (hadoopm1) to act as the file system drop-off area

$ mkdir -p /opt/hadoop/inputFiles

Create a directory within hdfs for testing

$ cd /opt/hadoop/hadoop/bin
$ ./hadoop  fs  -mkdir  -p  /test/testing

I will now push a file into HDFS to test that the cluster is healthy.
The file I will push into HDFS is the text of Shakespeare’s Hamlet.   For your convenience I put the play into a *.txt file and compressed it.  Please download the  file here:  Shakespeare’s Hamlet.

Once downloaded unzip and move the file (using your favorite sftp client) to hadoopm1: /opt/hadoop/inputFiles.  Once on the file system push the file into HDFS

$ cd /opt/hadoop/hadoop/bin
$ ./hadoop fs -copyFromLocal /opt/hadoop/inputFiles/Hamlet.txt /test/testing

You can ignore the “WARN util.NativeCodeLoader:” message
Check to see that the file made it into HDFS

$ ./hadoop fs -ls /test/testing

-rw-r–r– 2 hduser supergroup 181255 2014-01-12 22:52 /test/testing/Hamlet.txt

19. Running MapReduce
Now for some fun. I will run a MapReduce job that creates a distinct list of words and their counts from Shakespeare’s Hamlet

Working from hadoopm1

$ ./yarn jar /opt/hadoop/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /test/testing/Hamlet.txt /test/testing/mapreduceresult

A bunch of output will fly by. A successful completion will end with something similar to:

14/01/29 23:38:07 INFO mapreduce.Job: map 0% reduce 0%
14/01/29 23:38:16 INFO mapreduce.Job: map 100% reduce 0%
14/01/29 23:38:25 INFO mapreduce.Job: map 100% reduce 100%
14/01/29 23:38:25 INFO mapreduce.Job: Job job_1391065395488_0001 completed successfully

The result of the MapReduce job is sitting in HDFS.  I need to copy it out to the file system

$ cd /opt/hadoop/hadoop/bin
$ ./hadoop fs -copyToLocal /test/testing/mapreduceresult /opt/hadoop/inputFiles/HamletProcessed

In /opt/hadoop/inputFiles you will see a new folder named: HamletProcessed.
In this folder there will be a file with a name similar to “part-r-00000”.
Using your favorite sftp client pull this file back to your workstation and add a *.txt to it.
This file is a tab delimited list containing two variables: Word and Count.

It should be noted that this tab delimited file contains every unique word and its count. The sample MapReduce job that come with Hadoop is case sensitive. This means that “HAMLET” and “Hamlet” are treated as different words. Also, no stop word list was applied to the MapReduce. This means that words like “this”, “that”, “a”, “an” are counted too.

20. Hadoop GUIs

Hadoop, when running has several GUI tools that provide the general health of the cluster:

1. NameNode (hadoopm1) UI: http://192.168.116.101:50070/dfshealth.jsp
2. Job Tracker (hadoopm1) UI: http://192.168.116.101:8088/cluster/cluster
3. NodeManager (hadoops1) UI: http://192.168.116.102:8042/node
3. NodeManager (hadoops2) UI: http://192.168.116.103:8042/node

21. Giving credit

I want to give a special thanks to Ms. Neha Sharma who researched, recommended and configured all the *.xml files that are necessary to make Hadoop run.  Ms. Sharma is a brilliant and resourceful Java engineer who, like yours truly, shares an affinity for data science and big data.  She is making her blogging debut here on DataTechBlog.  Neha will be demonstrating how the default MapReduce job for word count in Hadoop can be modified to:

1. Be case insensitive
2. Include stop words (and, but, or, this, that, .. etc)

Also, I must give credit to several other websites and blogs that I researched to aid me in the installation of Hadoop:

Hadoop Apache Org
Michael G. Noll Blog
TECADMIN.NET
Shakespeare’s Hamlet: Originating Text

Step 22. Next Steps

This tutorial is the first basic step in the journey to manage and manipulate big data using Hadoop.  Now that you have a basic cluster configured, you are are able to perform basic MapReduce jobs.  Further, in the next few weeks you can expect to see tutorials on using Hive, Pig and Sqoop.  I expect many blog posts to follow, all centered on the hadoop eco-system.

In part 2 of this series Ms. Sharma blogs on how to improve the basic MapReduce functionality that comes default with Hadoop.

Regards, Louis

Leave a Comment

Filed under Big Data, Infrastructure, Tutorials

Leave a Reply

Your email address will not be published. Required fields are marked *