Up until this point, all instructional posts were tutorials on setting up an infrastructure and readying an environment for data science projects. Over the next two tutorials, I am going to walk you through a complete data analysis project. You will be shown the proper steps necessary to ensure a consistent and repeatable process that can be used for all your data analysis projects. Simply put, this tutorial’s goal is to create a framework and provide a set of tools that can be used to support any data science project.
1. Steps to a Data Analysis
When you perform a data analysis, the steps you follow will almost always include the same sequence of events for each project. This will result in a consistent and repeatable process that can be accounted for each step of the way. Having the ability to audit and keep a record of each step of your data analysis is extremely important. This is especially true when you are asked to justify the steps that led to a particular conclusion. The steps outlined here are a combination of: 1. My Experience, 2. Ideas expressed in EMC’s Data Science and Big Data Analytics course and 3. Ideas from Jeff Leek’s “Data Analysis” course offered through Coursera.
a. Discovery – What is the question to be answered and does an adequate data set exist such that a meaningful analysis can be performed.
b. Data acquisition – Here you will be gathering and staging the data in your environment. You should always try to start with data that has not been processed in any way. That is to say, you should always start with raw data. If you cannot get the raw data, you need to understand how the data was transformed and/or processed.
c. Data Cleansing – Raw data will almost always need to be cleansed. Tidy Data best practices should always be applied.
d. Data Munging – Alter names of data variables, split data variables into two or more, reshape data, seek outliers and remove if necessary, etc. During this phase, data is split into “training” and “test” data sets if applicable.
e. Data Exploration and Model Planning – Initial exploratory graphs, look for correlations in multivariate data, etc. This is the step where you begin to understand which types of models will be used to answer the questions asked of the data.
f. Data Analysis and Model Building – Training data is analyzed with model selections, results are recorded and patterns take shape. Final models are then applied (when applicable) to the “test” data.
g. Interpretation and scrutinization of results – During this time you ask yourself whether or not the results make sense, whether or not the correct models were applied, whether or not the test data set was complete, “are these results valid.” Always question the outcome of your analysis.
h. Analysis write-up – Write up a clear and methodical report of the sequence of events that culminated in the final analysis. It should be noted that you do not need to include every single step of your analysis. You need only include those steps that tell a complete, but compact story. It is critical that your write-up is written for the intended audience. If it is for non-technical people, ensure that the message is written in a language that is easily digestible by them. If the audience is technical, such as data scientists, then ensure the details of your modeling are described. Know your audience and write to them in a way that ensures your work is understood.
i. Operationalize – Produce your report, scripts, code, and supporting technical documentation. Run a pilot experiment and implement your models in a production environment.
2. Steps covered in this tutorial
A data analysis project can be quite lengthy as a function of a number of factors. I want to demonstrate, in a reasonable amount of detail, each step of the analysis and as such will break up this tutorial into 2 parts. For this part of the tutorial I am going to demonstrate and walk you through:
a.) Setting up a folder structure to support a data science project.
c.) Data Acquisition
d.) Data Cleansing
e.) Data Munging
3. Data analysis description
In this tutorial I will be performing an analysis of Pearson’s height data which is a bivariate list of heights of fathers and their sons. This data was the result of a famous experiment by Karl Pearson in the early 1900’s. The goal here is to determine a simple Linear Regression using the method of Basic Least Squares that will answer the question: “Is there a correlation between a father’s height and his son’s height? If so can a model be built to represent the relationship between the two variables”?
4. Creating a folder structure to support a data analysis project.
When you perform a data analysis you tend to accumulate lots of scripts, plots, figures and data (raw and munged). You should maintain an order around these files, An easy way to do that is through a simple folder structure on your workstation. The figure below is an example of what a folder structure may look like:
Can I predict the height of son (fully grown) given his father’s height? Using the Pearson data set, can I create a model that will predict the height of a son given (input) the height of the father?
6. Downloading Pearson’s height data using R
So now that I created a folder structure to support this project, I can begin by first creating an R script that will be used to contain code for acquiring the Pearson data set. I personally like to make this a stand-alone file that has the sole purpose of gathering data and then persisting it to disk. The basic steps for this task are:
a. Set working directory using the R “setwd() command (Linux environment)
* Note. If you are working on a Windows machine your path would look similar to:
setwd(‘c:\\users\\fred\\projects\\pearson\\data\\raw’). Escape the backslash.
b. Instantiate the R “fileUrl” object with url path to data.
> fileUrl <- “http://datatechblog.com/wp-content/uploads/2013/11/pearsonData.csv?accessType=DOWNLOAD”
It should be noted that this data set has been altered from the original data set to include data anomolies. This was done to help illustrate the data cleansing portion of this tutorial. The original “raw” data can be downloaded here.
c. Download data using the “download.file” command.
This will put the data into a file named “pearsonDataRaw.csv” into the directory defined by the “setwd” command.
d. Create a variable to indicate the date and time the data was downloaded.
> dateDownloaded <- date()
 “Mon Dec 2 21:06:53 2013″
e. Read the data into a data.frame .
> pearsonDataRaw <- read.csv(“./pearsonDataRaw.csv”,as.is=TRUE);
You can see these steps in full R detail by downloading initialDataLoad_Pearson
7. Data preparation and Tidy data
We now have our data in a construct called a data.frame which in short is a collection of vectors, matrices, or other data frames of equal length. It is in this form that we will explore the data.
a. Using the dim() command determine the number of rows and columns:
 1078 2
The output tell us there are 2 columns of data with a row count of 1078
This is a good time to point out the online help functionality of R Studio. While in the console window if you type a question mark and the name of a function the help window will display help on that item.
b. Determine the variable names with the “names()” command:
 “father” “son”
c. We know from the description of the Pearson data that the two variables contain height data for the father and the son. The titles of these variables “father”, “son” don’t tell us anything about what type of data is present. I am going to give these two variable new names that are more descriptive, “father” will become “fatherHeight”, and “son” will become “sonHeight”:
names(pearsonDataRaw) <- “fatherHeight”
names(pearsonDataRaw) <- “sonHeight”
Check that the names have been changed;
 “fatherHeight” “sonHeight”
8. Data exploration
a. Next let’s look and see what types of data are stored in the two variables:
We now know that the height data for both the father and the son are numeric type.
b. Given that our data is of numeric type we can use the summary() command to generate sample quantiles of the data in the data frame:
Right away I see a problem in the data. For the variable “fatherHeight” I see a minimum height of 27.81 inches. Also, for “sonHeight” I see a minimum height of 0 inches. This is a good time to introduce some exploratory plots to help with identifying data issues and help with describing how the data is distributed. A quick side note about exploratory plots. These types of plots are made quickly and they don’t have to be perfect. Things such as axis labels don’t have to be formal. You can use the data.frame variable names and colors can be used to help show patterns (multivariate plots). Also, these types of plots are plentiful. You will be creating lots of these to help understand the data.
c. Using the plot() command I create an exploratory graph of the data.frame pearsonDataRaw;
>plot(pearsonDataRaw$sonHeight ~ pearsonDataRaw$fatherHeight, col=”blue”, pch=19)
Figure (1) shows a nice cluster of points in the upper right quadrant. It also shows what appears to be 3 outliers. Two of them indicate a “sonHeight” of less than 10 inches, and one shows a “fatherHeight” of less than 30 inches. This is congruent with with the output of the summary() command demonstrated earlier.
d. Using the hist() command, I create a histogram of the variable “fathterHeight”. This helps me understand how the data is distributed;
axis(side=1, at=seq(10,80,1), xlim=c(10, 80))
I also add a red line to indicate the mean of the distribution;
It can be seen that the outlier is forcing the histogram to be squished. This narrowing of the histogram is causing some of the resolution to be lost.
e.) Let’s create the same plot for the variable “sonHeight”
>axis(side=1, at=seq(0,80,1), xlim=c(10, 80))
Overlay the historgram with a read “mean” line
It is evident that the three bad data points are having an effect on the shape and distribution of the data. It is safe to assume that removal of these data points will not cause issues with the analysis. It can only help. When performing an exploratory analysis such as this, these types of data, once identified, need to be removed from the data set. However, care and special attention must be given to data points that are statistical outliers and not just outright errors like we have here. Unlike this tutorial, the difference between an outright error and a statitistical outlier may not be clear. This is where a solid understanding of the domain will prove to be invaluable.
9. Data munging
Now that we have identified three data points that are errors, we need to remove them from the data to ensure that the outcome of the analysis is not skewed by bad data. To remove bad data in a data.frame you perform a subsetting of the data. I am going to create a new data.frame called pearsonDataMunged and it will consist of the data from pearsonDataRaw with the omission of the three bad records:
>pearsonDataMunged <- subset(pearsonDataRaw, pearsonDataRaw$fatherHeight>30 & pearsonDataRaw$sonHeight>30)
The command above says: Using the pearsonDataRaw data.frame select all records that have a “fatherHeight” greater than 30 inches “AND” a “sonHeight” greater than 30 inches.
So now if we redraw Figures (1-3) with the “pearsonDataMunged” data.frame we will see how the data shape and distribution has changed. You can run the same R commands listed above with the one substitution of “pearsonDataMunbed” in place of “pearsonDataRaw.”
First the Scatter Plot
>plot(pearsonDataMunged$sonHeight ~ pearsonDataMunged$fatherHeight, col=”blue”, pch=19)
With the noise gone, the scatter plot looks much better. It has what appears to be nice symmetry and a reasonably close distribution of points.
Next the histogram of “fatherHeight”
>axis(side=1, at=seq(10,80,1), xlim=c(10, 80))
The histogram also looks much better with the bad data gone. I see a nice symmetric distribution centered around the mean.
Finally, the histogram of “sonHeight”
>axis(side=1, at=seq(10,80,1), xlim=c(10, 80))
The same outcome for the “sonHeight” histogram.
Finally, like I did in step 8.b I want to once again use the summary() command on the the munged data:
We can now see a few changes to the output of the summary() command:
a.) Min heights for both “fatherHeight” and “sonHeight” are now in line with expectations of the data
b.) Slight changes to 1st Quartile, Median, Mean, and 3rd Quartile.
As the final step, I will write my munged data.frame to disk:
You can see these steps (6-9) in full R detail by downloading initialDataScrub_Pearson.
Congratulations! You just completed several crucial steps of a data analysis. These steps lay the foundation for the next step (Part 2 of 2) which include: 1) Performing Basic Least Squares Analysis and 2) Writing up the analysis.