Category Archives: Big Data

Operationalizing a Hadoop Eco-System (Part 3: Installing and using Hive)


Hadoop Hive


In part 1 of this series, I demonstrated how to install, configure, and run a three node Hadoop cluster. In part 2, you were shown how to take the default “word count” YARN job that comes with Hadoop 2.2.0 and make it better.  In this leg of the journey, I will demonstrate how to install and run Hive.  Hive is a tool that sits atop Hadoop and facilitates YARN (next generation map-reduce) jobs without having to write Java code.  With HIVE, and its scripting language HiveQL, querying data across HDFS is made simple.  HiveQL is a SQL like scripting language which enables those with SQL knowledge immediate access to data in HDFS.  HiveQL also lets you reference custom MapReduce scripts right in HiveQL queries.

Photo via


Continue reading

1 Comment

Filed under Big Data, Infrastructure, Tutorials

Operationalizing a Hadoop Eco-System (Part 2: Customizing Map Reduce)


Hadoop Map Reduce

It gives me great pleasure to introduce a new contributor to DataTechBlog.  Ms. Neha Sharma makes her debut with this blog post.  Neha is a talented software engineer and big data enthusiast.  In this post, she will be demonstrating how to enhance the “word count” map reduce job that ships with hadoop.   The enhancements will include the removal of “stop” words, the option for case insensitivity and the removal of punctuation.

In part 1 of this series you were shown how to install and configure a hadoop cluster.  Here you will be shown how to modify a map reduce job. In this case the job to be modified is the word count example that ships with hadoop.

photo via
Continue reading

1 Comment

Filed under Big Data

Modern Data Architecture: The Data Lake

Modern Data ArchitectureToday, one of the hottest topics out there is “big data.”  It seems that everybody is talking about it and more and more companies are throwing their hat into the big data ring.  These are exciting times because there is a fundamental shift in how we think about data.  Not that long ago, structured data reigned supreme.  For data architects, the methods of handling data (transactional and dimensional) were based in sound theory thanks to E.F. Codd (Relational Modeling), Bill Inmon (Top Down 3NF Design), Ralph Kimball (Dimensional Modeling) and Daniel Linstedt (Data Vault Architecture).  We are now living in the post-relational world where the majority of the data (estimates have it at 80%) being generated is either semi-structured, quasi-structured or unstructured (1).  Further, this data is growing at a rapid rate.  As of 2012 ,digital content is being created at a rate of 2.5 quintilion ( 1 with 18 trailing zeros) bytes of data each day!(2)  Moreover, between 2009 and 2020 we can expect to see a 44 fold increase in all digital content. Of this data only 5% will be classified as structured (3).  So, with all those impressive stats, the question staring us in the face is this: ” How do we manage this deluge of unstructured data and how do we get it to play nice with structured data?”  Enter the Data Lake!

Photo via

Continue reading

1 Comment

Filed under Big Data

Operationalizing a Hadoop Eco-System (Part 1: Installing & Configuring a 3-node Cluster)

hadoop eco-systemThe objective of DataTechBlog is to bring the many facets of data, data tools, and the theory of data to those curious about data science and big data.  The relationship between these disciplines and data can be complex.  However, if careful consideration is given to a tutorial, it is a practical expectation that the layman can be brought online quickly.  With that said, I am extremely excited to bring this tutorial on the Hadoop Eco-system.  Hadoop & MapReduce (at a high level) are not complicated ideas.  Basically, you take a large volume of data and spread it across many servers (HDFS).  Once at rest, the data can be acted upon by the many CPU’s in the cluster (MapReduce).  What makes this so cool is that the traditional approach to processing data (bring data to cpu) is flipped.  With MapReduce, CPU is brought to the data.  This “divide-and-conquer” approach makes Hadoop and MapReduce indispensable when processing massive volumes of data.  In part 1 of this multi-part series, I am going to demonstrate how to install, configure and run a 3-node Hadoop cluster.  Finally, at the end I will run a simple MapReduce job to perform a unique word count of Shakespeare’s Hamlet.  Future installments of this series will include topics such as: 1. Creating an advanced word count with MapReduce, 2. Installing and running Hive, 3. Installing and running Pig, 4. Using Sqoop to extract and import structured data into HDFS.  The goal is to illuminate all the popular and useful tools that support Hadoop.

Photo via

Continue reading

Leave a Comment

Filed under Big Data, Infrastructure, Tutorials