Operationalizing a Hadoop Eco-System (Part 3: Installing and using Hive)


Hadoop Hive


In part 1 of this series, I demonstrated how to install, configure, and run a three node Hadoop cluster. In part 2, you were shown how to take the default “word count” YARN job that comes with Hadoop 2.2.0 and make it better.  In this leg of the journey, I will demonstrate how to install and run Hive.  Hive is a tool that sits atop Hadoop and facilitates YARN (next generation map-reduce) jobs without having to write Java code.  With HIVE, and its scripting language HiveQL, querying data across HDFS is made simple.  HiveQL is a SQL like scripting language which enables those with SQL knowledge immediate access to data in HDFS.  HiveQL also lets you reference custom MapReduce scripts right in HiveQL queries.

Photo via


As an example, the Java code needed to perform a simple word count with YARN is about 100 lines of code. In HiveQL only 3 lines are needed:

hive> CREATE TABLE test (line STRING);
hive> LOAD DATA INPATH ‘/test/testing/hamlet.txt’ OVERWRITE INTO TABLE test;
hive> SELECT word, COUNT(*) FROM test LATERAL VIEW explode(split(line, ‘ ‘)) lTable as word
GROUP BY word;

By leveraging Hive/HiveQL, people are afforded easy access to data in HDFS.  The data is easily queried using a “SQL like” language. This empowers a much broader spectrum of users, thus promoting more actionable information.


Assumptions

-This part of the series presumes that you have read and implemented  part 1 of the series.
– If you did not work through part 1 of the series, then it is presumed you are running hadoop 2.2.0.  Newer versions may work.  However, they have not been tested.


Tidbits

The installation of HIVE occurs on the master node of the cluster.  In this tutorial the master node is called hadoopm1.
– Be conscious that some commands wrap due to blog templating. An example of this is step 1 “wget” below
– “#” indicates that the command is run as “root”
– “$” indicates that the command is run as “hduser”
All linux/HIVE commands are highlighted in grayish blue
All output is highlighted in orange

Index
1. Downloading & unpacking Apache HIVE
2. Installing HIVE & setting permissions
3. Configuring environment (.bash_profile)
4. Invoking HIVE command line (testing the install)
5. Loading data (hamlet.txt) into HDFS
6. Writing HiveQL to perform simple word count (several variations)
7. Final Thoughts
8. Giving Credit

Let’s Begin


1.  Log in as “root” on hadoopm1 (master node) and change to /opt directory.

# cd /opt

Download the latest “stable” version of HIVE. At the time of this post, 0.13.1 was the latest release. If you find that this release no longer exists simply navigate the URL to determine the latest.

# wget http://apache.spinellicreations.com/hive/hive-0.13.1/apache-hive-0.13.1-bin.tar.gz

Unpack the file apache-hive-0.13.1-bin.tar.gz

# tar -xzf apache-hive-0.13.1-bin.tar.gz

2. Rename unpacked directory to something more friendly.

# mv  apache-hive-0.13.1-bin   hive

Change ownership and group of the “hive” directory.

# chown  -R  hduser:hadoop   /opt/hive

3. Modify environment (.bash_profile) for “hduser.”

# su – hduser
$ cd   /home/user

Open the .bash_profile for edit with your favorite editor and modify it so that it includes:

# User specific environment and startup programs

JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.51.x86_64/jre
PATH=$PATH:$HOME/bin:/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.51.x86_64/jre/bin:/opt/hadoop/haddop/bin:/opt/hive/bin
HADOOP_COMMON_LIB_NATIVE_DIR=/opt/hadoop/hadoop/lib/native
HADOOP_OPTS=”-Djava.library.path=/opt/hadoop/hadoop/lib”
HIVE_HOME=/opt/hive
HADOOP_HOME=/opt/hadoop/hadoop
export PATH
export JAVA_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR
export HADOOP_OPTS
export HIVE_HOME
export HADOOP_HOME


4. You should now be able to invoke the HIVE CLI.

$ hive

hive>


5. We now need to load some data into HDFS. If you recall in part 1 (step 18) of this series, we loaded a text file that contained the complete dialogue of Shakespeare’s Hamlet. I will now reproduce those steps here again.

Create directory on master node (hadoopm1) to act as the file system drop-off area.

$ mkdir -p  /opt/hadoop/inputFiles

Create a directory within hdfs for HIVE testing

$ cd  /opt/hadoop/hadoop/bin
$ ./hadoop  fs  -mkdir  -p   /test/hivetesting

I will now push a file into HDFS to test that the cluster is healthy.
The file I will push into HDFS is the text of Shakespeare’s Hamlet.   For your convenience I put the play into a *.txt file and compressed it.  Please download the  file here:  Shakespeare’s Hamlet.

Once downloaded, unzip and move the file (using your favorite sftp client) to hadoopm1: /opt/hadoop/inputFiles. Once on the file system push the file into HDFS

$ cd  /opt/hadoop/hadoop/bin
$ ./hadoop  fs  -copyFromLocal /opt/hadoop/inputFiles/Hamlet.txt   /test/hivetesting

You can ignore the “WARN util.NativeCodeLoader:” message
Check to see that the file made it into HDFS

$ ./hadoop  fs  -ls  /test/hivetesting

-rw-r–r– 2 hduser supergroup 188274 2014-02-22 22:52 /test/testing/Hamlet.txt


6. Writing HiveQL
Now for some fun. I will write a simple HiveQL query that will create a distinct list of words and their counts from Shakespeare’s Hamlet.

$ cd /opt/hive/bin
$ ./hive

Create a table structure to hold parsed word tokens

hive> create external table if not exists hivetest(line string) location ‘/test/hivetesting/Hamlet.txt’;

OK
Time taken: 0.968 seconds

Simple word count

hive> select word, count(*) from hivetest lateral view explode(split(line,’ ‘)) ITable as word group by word;

You should see (after a few seconds of processing) a whole lot of words with counts scroll by very fast.


youth, 4
youth: 1
zone, 1
Time taken: 50.0 seconds, Fetched: 8758 row(s)
hive>

To make this example a bit more useful, I will demonstrate how to use HIVE in “non-interactive” mode. This will require that we first create a hive.hql script, then invoke it via the CLI. Using your favorite editor open a file “/opt/hive/scripts/hivetest.hql” to  edit and add the following:

create table if not exists hivetest(line string);
load data inpath ‘/test/hivetesting/Hamlet.txt’ overwrite into table hivetest;
select word, count(*) from hivetest lateral view explode(split(line,’ ‘)) ITable
as word group by word;

Next I invoke the “hive” cli and pass as an argument the “hivetest.hql” file we just created:

$ hive  -v  hivevar  threshold=10  -f  hivetest.hql  > hivetest.tsv

See hivetest1.hql for script.
This command will produce an output file “hivetest.tsv” that will contain the complete list of words and their counts.

A variation on the word count is to trim white space and remove punctuation. This is achieved with the following form of the query:

hive> select regexp_replace(lower(trim(word)), “[^A-Za-z0-9]”,””), count(*) FROM hivetest lateral view explode(split(line, ‘ ‘)) lTable as word group by regexp_replace(lower(trim(word)), “[^A-Za-z0-9]”,””);

See hivetest2.hql for script.
See hivetest2.tsv for the output of this query.

Another variation involves the use of “stop words.” Stop words are words you don’t want counted in the word word. A few examples of typical stop words would include: and, but, or, this, that, these, I, you, etc. Using stop words can be achieved with the following form of the query:

hive> select regexp_replace(lower(trim(word)), “[^A-Za-z0-9]”,””) w, count(*)
from hivetest lateral view explode(split(line, ‘ ‘)) lTable as word
where (regexp_replace(lower(trim(word)), “[^A-Za-z0-9]”,””)) not in (‘a’,’and’,’the’,’you’,’your’,’they’,’them’,’i’,’is’,’am’,’are’)
group by regexp_replace(lower(trim(word)), “[^A-Za-z0-9]”,””);

See hivetest3.hql for script.
See hivetest3.tsv or the output of this query.


7. Final Thoughts
Upon completion of this tutorial it should be evident that HIVE and HiveQL are powerful tools. They enable anyone with SQL experience the ability to leverage the power of HADOOP to query massive volumes of data with the efficiency of YARN (MapReduce). The HiveQL language is quite robust and affords most of the “SQL” extensions that you would need in your day-to-day database queries.


8. Giving Credit
I want to mention Ms. Neha Sharma who provided technical assistance with this post.
Also, I must give credit to two other websites that I researched to aid in this post:

Apache HIVE
Confluence Spaces

Regards, Louis

1 Comment

Filed under Big Data, Infrastructure, Tutorials

One Response to Operationalizing a Hadoop Eco-System (Part 3: Installing and using Hive)

  1. 11/26/2014 – Changed HIVE version to 0.13.1.

Leave a Reply

Your email address will not be published. Required fields are marked *