Category Archives: Tutorials

Technical How To’s

Structuring a Data Analysis using R (Part 2 of 2) – Analyzing, Modeling, and the Write-up.

Data Analysis Using R

In the first part (Structuring a Data Analysis using R (Part 1 of 2)) of this two part series, I discussed several key aspects necessary to any successful data analysis project. In that post I also began a prototypical data analysis project working my way up through munging of the data. All those steps in (Part 1 of 2) enabled me to begin the analysis and modeling parts of the project.  This post picks up and continues with the data analysis which will culminate in a formal write-up of the data analysis demonstrated here.

Continue reading

2 Comments

Filed under Foundations, Tutorials

Structuring a Data Analysis using R (Part 1 of 2) – Gathering, Organizing, Exploring and Munging Data.

ExploratoryAnalysisOver the next two tutorials, I am going to walk you through a complete data analysis project.  You will be shown the proper steps necessary to ensure a consistent and repeatable process that can be used for all your data analysis projects.  Simply put, this tutorial’s goal is to create a framework and provide a set of tools that can be used to support any data science project.

Photo via

Continue reading

6 Comments

Filed under Foundations, Tutorials

Data Warehouse ETL Offload with Hadoop.

Data Warehouse ETL Offload with Hadoop

Data volumes are growing at an exponential rate causing problems for traditional IT infrastructures.  As a result, we are seeing more and more organizations taking advantage of emerging technologies, like Hadoop, to help mitigate the pressure of exploding data volumes.  Hadoop and its eco-system of tools play an important role in tackling tough problems that are plaguing traditional IT and data warehouse environments.

Specifically, Extract, Transform, and Load (ETL) can be offloaded to Hadoop to address the problem of exploding data volumes that are breaking traditional IT and ETL processes.  Using screen casts and animated video, I will demonstrate how Hadoop can be used to offload the most taxing ETL workloads.

First, I will demonstrate the overarching problem of ETL overload in a fictitious company called Acme Sales.  Proceeding from there are actual demonstrations (screen casts) of an ETL offload using Hadoop, Hive, and Sqoop.

Before we begin I would like to introduce Noelle Dattilo, the newest guest author on DataTechBlog.  Noelle is an education expert who specializes in animation technology.  Noelle gets full credit for creating all the animations you are about to see in this post.  I asked Noelle to explain her process for creating compelling animations.

“These series of animations are created by an on-line program called VideoScribe, a white board animation tool.  These animations leverage problem-based learning to paint a conceptual picture in a form that is easily digestible to a wide audience.  To create the animations, I found some interesting graphics, turned them into SVG files (that’s the tricky part,) uploaded them into VideoScribe and placed them in the order to be drawn.  Once I uploaded the audio track that Louis recorded, I synced the timing for each animation to be drawn, with the track, and voila` we have our Acme Sales animations.”

Photo via


Continue reading

2 Comments

Filed under Big Data Use Cases, Tutorials

Operationalizing a Hadoop Eco-System (Part 3: Installing and using Hive)


Hadoop Hive


In part 1 of this series, I demonstrated how to install, configure, and run a three node Hadoop cluster. In part 2, you were shown how to take the default “word count” YARN job that comes with Hadoop 2.2.0 and make it better.  In this leg of the journey, I will demonstrate how to install and run Hive.  Hive is a tool that sits atop Hadoop and facilitates YARN (next generation map-reduce) jobs without having to write Java code.  With HIVE, and its scripting language HiveQL, querying data across HDFS is made simple.  HiveQL is a SQL like scripting language which enables those with SQL knowledge immediate access to data in HDFS.  HiveQL also lets you reference custom MapReduce scripts right in HiveQL queries.

Photo via


Continue reading

1 Comment

Filed under Big Data, Infrastructure, Tutorials

Operationalizing a Hadoop Eco-System (Part 1: Installing & Configuring a 3-node Cluster)

hadoop eco-systemThe objective of DataTechBlog is to bring the many facets of data, data tools, and the theory of data to those curious about data science and big data.  The relationship between these disciplines and data can be complex.  However, if careful consideration is given to a tutorial, it is a practical expectation that the layman can be brought online quickly.  With that said, I am extremely excited to bring this tutorial on the Hadoop Eco-system.  Hadoop & MapReduce (at a high level) are not complicated ideas.  Basically, you take a large volume of data and spread it across many servers (HDFS).  Once at rest, the data can be acted upon by the many CPU’s in the cluster (MapReduce).  What makes this so cool is that the traditional approach to processing data (bring data to cpu) is flipped.  With MapReduce, CPU is brought to the data.  This “divide-and-conquer” approach makes Hadoop and MapReduce indispensable when processing massive volumes of data.  In part 1 of this multi-part series, I am going to demonstrate how to install, configure and run a 3-node Hadoop cluster.  Finally, at the end I will run a simple MapReduce job to perform a unique word count of Shakespeare’s Hamlet.  Future installments of this series will include topics such as: 1. Creating an advanced word count with MapReduce, 2. Installing and running Hive, 3. Installing and running Pig, 4. Using Sqoop to extract and import structured data into HDFS.  The goal is to illuminate all the popular and useful tools that support Hadoop.

Photo via

Continue reading

Leave a Comment

Filed under Big Data, Infrastructure, Tutorials

Greenplum, R, Rstudio, and Data. The Basic Ingredients for Successful Recipes.

IngredientsIn the last three tutorials (Tutorial 1, Tutorial 2, Tutorial 3), I demonstrated how to create an infrastructure to support data science projects.  Next in the evolution is to show you how you can load data into Greenplum and R for analysis. For this tutorial I am using the famous Fisher Iris data set.  This data is most often used to demonstrate how discriminant analysis can be used to manifest obvious similarities and dissimilarities of objects, and in the case of the Fisher Iris data set, three species of Iris.  I chose this particular data because we will be using it in a tutorial in the near future.

Continue reading

Leave a Comment

Filed under Foundations, Tutorials

Building an Infrastructure to Support Data Science Projects (Part 3 of 3) – Installing and Configuring R / RStudio with Pivotal Greenplum Integration

RLogoIn this third and final part (Part 1 of 3, Part 2 of 3) of the series, I walk you through the installation and configuration of R and RStudio.  I also demonstrate how R is integrated with Pivotal Greenplum.  For those of you who don’t know what R is, you can go here for a lot of useful information.  In short, R is a scripting language and runtime environment used for performing complex (or simple) statistical analysis of data. This tool is available for free under the GNU General Public License.  RStudio is a free and open source IDE for R. You can go here for more information about RStudio.

Continue reading

17 Comments

Filed under Infrastructure, Tutorials

Building an Infrastructure to Support Data Science Projects (Part 2 of 3) – Installing Greenplum with MADlib

Installing GreenplumIn the first part of this series (Part 1 of 3) we installed and configured CentOS on a virtual machine.  This laid the foundation and made ready an environment that will now be used to install Pivotal Greenplum Community Edition. This edition allows for any use on a single node per Pivotal’s license model.  Also, as part of this tutorial I will be demonstrating how to install MADlib (open-source) libraries into Greenplum.  MADlib provides a rich set of libraries for advanced in-database data analysis and mining which can be called via regular SQL. The installation of Greenplum and MADlib will facilitate some of the data science excercises I will be demonstrating in the near future.

Continue reading

2 Comments

Filed under Infrastructure, Tutorials

Building an Infrastructure to Support Data Science Projects (Part 1 of 3) – Creating a Virtualized Environment.

Construction

As with any project or experiment,  infrastructure has to be in place to support the intended work.  For the case of a data science project, the obvious first step is the computing environment.  Simply stated, you can’t do advanced analytics on large data sets without CPU, RAM and Disk. With these items as your foundation, much can be designed, engineered and built.  Before we can walk through a data science project we need to first have hardware and software in place.  For the purposes of the tutorials here on DataTechBlog, a P.C. or laptop with adequate CPU, RAM and disk will suffice.  Further, it is  my plan to use only open or free software and code for all tutorials. You need only a reasonably spec’d computer to accomplish all that we will do here.  This tutorial will walk you through the installation of VMWare Player  and CentOS 6.x (Optimized for Pivotal Greenplum).  This lays the foundation for the next steps which will include the installation of Pivotal Greenplum, MADlib libraries, R, and R-Studio.  When this environment is complete, you will be able to perform many types of “in database” analysis using SQL with MADlib, analysis using R with Greenplum, and analysis with R against flat files or manually entered data.

Continue reading

Leave a Comment

Filed under Infrastructure, Tutorials