It gives me great pleasure to introduce a new contributor to DataTechBlog. Ms. Neha Sharma makes her debut with this blog post. Neha is a talented software engineer and big data enthusiast. In this post, she will be demonstrating how to enhance the “word count” map reduce job that ships with hadoop. The enhancements will include the removal of “stop” words, the option for case insensitivity and the removal of punctuation.
In part 1 of this series you were shown how to install and configure a hadoop cluster. Here you will be shown how to modify a map reduce job. In this case the job to be modified is the word count example that ships with hadoop.
– All Linux O.S. commands preceded by “#” implies “run as root.”
– Be conscious that some commands wrap due to blog templating. An example of this is with the “Second Method” below. The “Sample Command” wraps twice.
In the hadoop distribution source code for the word count job (hadoop-mapreduce-examples-2.2.0-sources.jar) can be found at:
You will see WordCount.java file inside hadoop-mapreduce-examples-2.2.0-sources.jar. WordCount.java is the source that we will be changing.
The reason we are changing this code is simple. When performing word count operations you want to filter some common words. These “stop words” generally include words such as articles, pronouns, conjunctions, etc. Also, punctuation should be removed as well, unless there is a direct reason to keep them. In our example we are modifying the map reduce function to perform the following:
- Exclude stop words
- Convert words to lowercase. This is required because the map reduce job is case sensitive. E.g. “System” <> “system”
- Remove all punctuation characters (.,”:;?). This is done to avoid treating “system” and “system.” as two different words
You can customize the map reduce function how ever you like by changing WordCount.java. There can be many possible solutions or implementation methods to achieve the above stated objectives. The intent here is just to show an example. If you are familiar with Java, you may go with a more efficient implementation. I will demonstrate two implementation methods here leading to the same results.
In this case we will be making use of an Arraylist to store all common words we want to exclude from our result set. Below is a snippet of modified mapper piece of original WordCount class.
There are few important things to notice here:
First, we are creating a new Arraylist “stopwords” (line 27) which we will use to store all unwanted stop/common words. Second, we are making use of “setup” method inside the mapper class (line 32). This method is called once at the beginning of the task whereas map method is called once for each key/value pair. We are populating the new arraylist inside the setup method. Third is the modified map method (line 49) where we are first removing all punctuation characters and converting value to lower case (line 55) and then we are comparing this value with contents of our “stop words” list to exclude words from result (line 57).
One of the problems with this approach is that you have to hard code “stop word” inside code. In order to add any other word to this list, you would need to modify your code and compile it again.
A better solution is to use an external file with a list of “stop words” that gets passed as an argument to the program. You can make use of hadoop’s distributed cache approach to achieve this.
DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications. Applications specify the files, via urls (hdfs:// or http://) to be cached via the JobConf. The DistributedCache assumes that the files specified via urls are already present on the file system at the path specified by the url and are accessible by every machine in the cluster.
You can find more details about distributed cache here.
Below attached is snippet of the modified code.
You now notice that our application needs three arguments. The first two are same as before i.e. input file and output location. Third argument is path/uri for external file containing “stop words” (line 124). If all required arguments are not provided , you would see usage error message while running job (line 125).
Another important thing to notice here is use of distributed cache (line 138). We are adding the file by providing a uri for the external file. This provides users with flexibility to keep external file wherever they want and they can modify the file or move the file or use a different file without having to change the code.
Now we would make use of this cached file in our mapper class:
In this case we are adding stop words to our arraylist “stopwords” using cached file inside setup method while the map method remains the same as in the first case. Inside our setup method, we are first retrieving our file from cache (line 53). Second we are reading file using BufferedReader (line 59). Finally we are writing the content of our file into the “stopwords” Arraylist (line 65).
To run modified map-reduce job use following commands:
Navigate to $HADOOP_HOME/
$ cd $HADOOP_HOME/bin
You can download jar file for this code from here.
Execute following commands as hadoop user:
./yarn jar <location of jar file> <path to input file> <output location/Dir>
To test our job we are using same Shakespeare’s Hamlet (hamlet.txt) file that we used in part 1 of this series. If you have not already downloaded this file, you can download it from here:
$ ./yarn jar /home/hduser/WordCountModified.jar /test/testing/hamlet.txt /test/testing/hamletresult1
To view results:
$ ./hadoop fs -cat /test/testing/hamletresult1/part-r-00000
You can download jar file for this code from here.
./yarn jar <location of jar file> <path to input file> <output location/Dir> <path/URI for file containing stop words>
$ ./yarn jar /home/hduser/WordCountFiltered.jar /test/testing/hamlet.txt /test/testing/hamletresult2 hdfs://aehadoopm1:9000/test/testing/StopWords.txt
To view results:
$ ./hadoop fs -cat /test/testing/hamletresult2/part-r-00000
Before running job please make sure of the following:
- Input files must be present in HDFS
- Output directory should not pre-exist (map-reduce will throw an error if output path already exist)
Having completed this exercise you should now have a level of comfort with manipulating map reduce jobs. As a function of your java skill set you should also have a better understanding of the map reduce framework. This will enable you to develop map reduce jobs to achieve other results.
In part 3 of the series I walk you through the installation and use of Apache HIVE & HiveQL.