Operationalizing a Hadoop Eco-System (Part 2: Customizing Map Reduce)


Hadoop Map Reduce

It gives me great pleasure to introduce a new contributor to DataTechBlog.  Ms. Neha Sharma makes her debut with this blog post.  Neha is a talented software engineer and big data enthusiast.  In this post, she will be demonstrating how to enhance the “word count” map reduce job that ships with hadoop.   The enhancements will include the removal of “stop” words, the option for case insensitivity and the removal of punctuation.

In part 1 of this series you were shown how to install and configure a hadoop cluster.  Here you will be shown how to modify a map reduce job. In this case the job to be modified is the word count example that ships with hadoop.

photo via
Continue reading

1 Comment

Filed under Big Data

Modern Data Architecture: The Data Lake

Modern Data ArchitectureToday, one of the hottest topics out there is “big data.”  It seems that everybody is talking about it and more and more companies are throwing their hat into the big data ring.  These are exciting times because there is a fundamental shift in how we think about data.  Not that long ago, structured data reigned supreme.  For data architects, the methods of handling data (transactional and dimensional) were based in sound theory thanks to E.F. Codd (Relational Modeling), Bill Inmon (Top Down 3NF Design), Ralph Kimball (Dimensional Modeling) and Daniel Linstedt (Data Vault Architecture).  We are now living in the post-relational world where the majority of the data (estimates have it at 80%) being generated is either semi-structured, quasi-structured or unstructured (1).  Further, this data is growing at a rapid rate.  As of 2012 ,digital content is being created at a rate of 2.5 quintilion ( 1 with 18 trailing zeros) bytes of data each day!(2)  Moreover, between 2009 and 2020 we can expect to see a 44 fold increase in all digital content. Of this data only 5% will be classified as structured (3).  So, with all those impressive stats, the question staring us in the face is this: ” How do we manage this deluge of unstructured data and how do we get it to play nice with structured data?”  Enter the Data Lake!

Photo via

Continue reading

1 Comment

Filed under Big Data

Operationalizing a Hadoop Eco-System (Part 1: Installing & Configuring a 3-node Cluster)

hadoop eco-systemThe objective of DataTechBlog is to bring the many facets of data, data tools, and the theory of data to those curious about data science and big data.  The relationship between these disciplines and data can be complex.  However, if careful consideration is given to a tutorial, it is a practical expectation that the layman can be brought online quickly.  With that said, I am extremely excited to bring this tutorial on the Hadoop Eco-system.  Hadoop & MapReduce (at a high level) are not complicated ideas.  Basically, you take a large volume of data and spread it across many servers (HDFS).  Once at rest, the data can be acted upon by the many CPU’s in the cluster (MapReduce).  What makes this so cool is that the traditional approach to processing data (bring data to cpu) is flipped.  With MapReduce, CPU is brought to the data.  This “divide-and-conquer” approach makes Hadoop and MapReduce indispensable when processing massive volumes of data.  In part 1 of this multi-part series, I am going to demonstrate how to install, configure and run a 3-node Hadoop cluster.  Finally, at the end I will run a simple MapReduce job to perform a unique word count of Shakespeare’s Hamlet.  Future installments of this series will include topics such as: 1. Creating an advanced word count with MapReduce, 2. Installing and running Hive, 3. Installing and running Pig, 4. Using Sqoop to extract and import structured data into HDFS.  The goal is to illuminate all the popular and useful tools that support Hadoop.

Photo via

Continue reading

Leave a Comment

Filed under Big Data, Infrastructure, Tutorials

Training for the Aspiring Data Scientist: My Experience with a MOOC.

In the inaugural post of DataTechBlog, I stated a goal of helping others learn more about the emerging field of data science and big data analytics.  It was my original intent to tackle this lofty goal primarily via instructive tutorials.  However, as of late, I realize just how important formal instruction is to the learning process.  Prior to my professional career in data, I spent the better part of 12 years in college earning various degrees and taking many (extra) classes.  The education process afforded me an opportunity to saturate myself in various topics.  In the milleau of the classroom setting, many opportunities abounded and I, having been a motivated student, was able to run with the proverbial ball.  I guess my point here is that  I (having been out of college for a while) forgot just how important classroom learning is to foster expertise in a particular subject and/or discipline.  As such,  I signed up for and successfully completed my first MOOC (Massive, Open, Online Course).

Photo via

Continue reading

1 Comment

Filed under Education & Instruction

DataTechBlog’s New Look and Feel

DataTechBlogNewLook
I am pleased to present the new look and feel of DataTechBlog. I felt the blog needed to speak a bit more to my personality and taste.  I am thrilled with the work done by a great UX designer. James Brown is outstanding at his craft – a tireless perfectionist who has an eye for utility, system values, and user-centered design. He worked with me and my whimsy for weeks as I put him through iteration after iteration of the DataTechBlog branding. The result, as can be seen above in the masthead, is an expression of the abstraction of pure data.  Also changed in the blog, was the implementation of Google Web Fonts. This along with a multitude of small changes throughout the site culminates in a synergy that helps drive home the vision of DataTechBlog.

James can be reached here.

Louis V. Frolio

Leave a Comment

Filed under Home, Main

Greenplum, R, Rstudio, and Data. The Basic Ingredients for Successful Recipes.

IngredientsIn the last three tutorials (Tutorial 1, Tutorial 2, Tutorial 3), I demonstrated how to create an infrastructure to support data science projects.  Next in the evolution is to show you how you can load data into Greenplum and R for analysis. For this tutorial I am using the famous Fisher Iris data set.  This data is most often used to demonstrate how discriminant analysis can be used to manifest obvious similarities and dissimilarities of objects, and in the case of the Fisher Iris data set, three species of Iris.  I chose this particular data because we will be using it in a tutorial in the near future.

Continue reading

Leave a Comment

Filed under Foundations, Tutorials

Beyond the Basics – Data Science for Business (Foster Provost & Tom Fawcett).

Data Science For BusinessIn an earlier post, I recommended a short read on introductory data science and big data. That book gave a fantastic overview of the major areas and ideas governing these disciplines. However, if you are motivated to dig deeper and wrap your head around details, then Data Science for Business should be your next read. This book does a fantastic job of helping the reader understand how one should think if they are considering data science as a profession, or they want to understand all the hype. Further, much detail and time is given to the idea of the “Data Analytics Lifecycle” which governs data science projects through process and a framework.  The authors meticulously step through the various modeling techniques with solid examples and explanations.  There are sections of the book that detail some math and their derivations which may prove to be challenge if your math is rusty. However it should not present too much of an obstacle with regards to understanding the gist of what is being conveyed.  In the preface, the authors state the book is intended for business people who are working with data scientists, managing data scientists, or seeking to understand the value in data science. Also, the book is suited to developers implementing data science solutions and finally, aspiring data scientists.  I believe that this book has a role to play in one’s education in data science and that it is an appropriate read for those wishing to understand, with detail, how data science is done and what it aims to achieve.


Louis V. Frolio

2 Comments

Filed under Suggested Readings

Building an Infrastructure to Support Data Science Projects (Part 3 of 3) – Installing and Configuring R / RStudio with Pivotal Greenplum Integration

RLogoIn this third and final part (Part 1 of 3, Part 2 of 3) of the series, I walk you through the installation and configuration of R and RStudio.  I also demonstrate how R is integrated with Pivotal Greenplum.  For those of you who don’t know what R is, you can go here for a lot of useful information.  In short, R is a scripting language and runtime environment used for performing complex (or simple) statistical analysis of data. This tool is available for free under the GNU General Public License.  RStudio is a free and open source IDE for R. You can go here for more information about RStudio.

Continue reading

17 Comments

Filed under Infrastructure, Tutorials

Building an Infrastructure to Support Data Science Projects (Part 2 of 3) – Installing Greenplum with MADlib

Installing GreenplumIn the first part of this series (Part 1 of 3) we installed and configured CentOS on a virtual machine.  This laid the foundation and made ready an environment that will now be used to install Pivotal Greenplum Community Edition. This edition allows for any use on a single node per Pivotal’s license model.  Also, as part of this tutorial I will be demonstrating how to install MADlib (open-source) libraries into Greenplum.  MADlib provides a rich set of libraries for advanced in-database data analysis and mining which can be called via regular SQL. The installation of Greenplum and MADlib will facilitate some of the data science excercises I will be demonstrating in the near future.

Continue reading

2 Comments

Filed under Infrastructure, Tutorials

Building an Infrastructure to Support Data Science Projects (Part 1 of 3) – Creating a Virtualized Environment.

Construction

As with any project or experiment,  infrastructure has to be in place to support the intended work.  For the case of a data science project, the obvious first step is the computing environment.  Simply stated, you can’t do advanced analytics on large data sets without CPU, RAM and Disk. With these items as your foundation, much can be designed, engineered and built.  Before we can walk through a data science project we need to first have hardware and software in place.  For the purposes of the tutorials here on DataTechBlog, a P.C. or laptop with adequate CPU, RAM and disk will suffice.  Further, it is  my plan to use only open or free software and code for all tutorials. You need only a reasonably spec’d computer to accomplish all that we will do here.  This tutorial will walk you through the installation of VMWare Player  and CentOS 6.x (Optimized for Pivotal Greenplum).  This lays the foundation for the next steps which will include the installation of Pivotal Greenplum, MADlib libraries, R, and R-Studio.  When this environment is complete, you will be able to perform many types of “in database” analysis using SQL with MADlib, analysis using R with Greenplum, and analysis with R against flat files or manually entered data.

Continue reading

Leave a Comment

Filed under Infrastructure, Tutorials