In part 1 of this series, I demonstrated how to install, configure, and run a three node Hadoop cluster. In part 2, you were shown how to take the default “word count” YARN job that comes with Hadoop 2.2.0 and make it better. In this leg of the journey, I will demonstrate how to install and run Hive. Hive is a tool that sits atop Hadoop and facilitates YARN (next generation map-reduce) jobs without having to write Java code. With HIVE, and its scripting language HiveQL, querying data across HDFS is made simple. HiveQL is a SQL like scripting language which enables those with SQL knowledge immediate access to data in HDFS. HiveQL also lets you reference custom MapReduce scripts right in HiveQL queries.
As an example, the Java code needed to perform a simple word count with YARN is about 100 lines of code. In HiveQL only 3 lines are needed:
hive> CREATE TABLE test (line STRING);
hive> LOAD DATA INPATH ‘/test/testing/hamlet.txt’ OVERWRITE INTO TABLE test;
hive> SELECT word, COUNT(*) FROM test LATERAL VIEW explode(split(line, ‘ ‘)) lTable as word
GROUP BY word;
By leveraging Hive/HiveQL, people are afforded easy access to data in HDFS. The data is easily queried using a “SQL like” language. This empowers a much broader spectrum of users, thus promoting more actionable information.
1. Log in as “root” on hadoopm1 (master node) and change to /opt directory.
# cd /opt
Download the latest “stable” version of HIVE. At the time of this post, 0.13.1 was the latest release. If you find that this release no longer exists simply navigate the URL to determine the latest.
# wget http://apache.spinellicreations.com/hive/hive-0.13.1/apache-hive-0.13.1-bin.tar.gz
Unpack the file apache-hive-0.13.1-bin.tar.gz
# tar -xzf apache-hive-0.13.1-bin.tar.gz
2. Rename unpacked directory to something more friendly.
# mv apache-hive-0.13.1-bin hive
Change ownership and group of the “hive” directory.
# chown -R hduser:hadoop /opt/hive
3. Modify environment (.bash_profile) for “hduser.”
# su – hduser
$ cd /home/user
Open the .bash_profile for edit with your favorite editor and modify it so that it includes:
# User specific environment and startup programs
4. You should now be able to invoke the HIVE CLI.
5. We now need to load some data into HDFS. If you recall in part 1 (step 18) of this series, we loaded a text file that contained the complete dialogue of Shakespeare’s Hamlet. I will now reproduce those steps here again.
Create directory on master node (hadoopm1) to act as the file system drop-off area.
$ mkdir -p /opt/hadoop/inputFiles
Create a directory within hdfs for HIVE testing
$ cd /opt/hadoop/hadoop/bin
$ ./hadoop fs -mkdir -p /test/hivetesting
I will now push a file into HDFS to test that the cluster is healthy.
The file I will push into HDFS is the text of Shakespeare’s Hamlet. For your convenience I put the play into a *.txt file and compressed it. Please download the file here: Shakespeare’s Hamlet.
Once downloaded, unzip and move the file (using your favorite sftp client) to hadoopm1: /opt/hadoop/inputFiles. Once on the file system push the file into HDFS
$ cd /opt/hadoop/hadoop/bin
$ ./hadoop fs -copyFromLocal /opt/hadoop/inputFiles/Hamlet.txt /test/hivetesting
You can ignore the “WARN util.NativeCodeLoader:” message
Check to see that the file made it into HDFS
$ ./hadoop fs -ls /test/hivetesting
-rw-r–r– 2 hduser supergroup 188274 2014-02-22 22:52 /test/testing/Hamlet.txt
6. Writing HiveQL
Now for some fun. I will write a simple HiveQL query that will create a distinct list of words and their counts from Shakespeare’s Hamlet.
$ cd /opt/hive/bin
Create a table structure to hold parsed word tokens
hive> create external table if not exists hivetest(line string) location ‘/test/hivetesting/Hamlet.txt’;
Time taken: 0.968 seconds
Simple word count
hive> select word, count(*) from hivetest lateral view explode(split(line,’ ‘)) ITable as word group by word;
You should see (after a few seconds of processing) a whole lot of words with counts scroll by very fast.
Time taken: 50.0 seconds, Fetched: 8758 row(s)
To make this example a bit more useful, I will demonstrate how to use HIVE in “non-interactive” mode. This will require that we first create a hive.hql script, then invoke it via the CLI. Using your favorite editor open a file “/opt/hive/scripts/hivetest.hql” to edit and add the following:
create table if not exists hivetest(line string);
load data inpath ‘/test/hivetesting/Hamlet.txt’ overwrite into table hivetest;
select word, count(*) from hivetest lateral view explode(split(line,’ ‘)) ITable
as word group by word;
Next I invoke the “hive” cli and pass as an argument the “hivetest.hql” file we just created:
$ hive -v hivevar threshold=10 -f hivetest.hql > hivetest.tsv
A variation on the word count is to trim white space and remove punctuation. This is achieved with the following form of the query:
hive> select regexp_replace(lower(trim(word)), “[^A-Za-z0-9]”,””), count(*) FROM hivetest lateral view explode(split(line, ‘ ‘)) lTable as word group by regexp_replace(lower(trim(word)), “[^A-Za-z0-9]”,””);
Another variation involves the use of “stop words.” Stop words are words you don’t want counted in the word word. A few examples of typical stop words would include: and, but, or, this, that, these, I, you, etc. Using stop words can be achieved with the following form of the query:
hive> select regexp_replace(lower(trim(word)), “[^A-Za-z0-9]”,””) w, count(*)
from hivetest lateral view explode(split(line, ‘ ‘)) lTable as word
where (regexp_replace(lower(trim(word)), “[^A-Za-z0-9]”,””)) not in (‘a’,’and’,’the’,’you’,’your’,’they’,’them’,’i’,’is’,’am’,’are’)
group by regexp_replace(lower(trim(word)), “[^A-Za-z0-9]”,””);
7. Final Thoughts
Upon completion of this tutorial it should be evident that HIVE and HiveQL are powerful tools. They enable anyone with SQL experience the ability to leverage the power of HADOOP to query massive volumes of data with the efficiency of YARN (MapReduce). The HiveQL language is quite robust and affords most of the “SQL” extensions that you would need in your day-to-day database queries.
8. Giving Credit
I want to mention Ms. Neha Sharma who provided technical assistance with this post.
Also, I must give credit to two other websites that I researched to aid in this post: