In the first part (Structuring a Data Analysis using R (Part 1 of 2)) of this two part series, I discussed several key aspects necessary to any successful data analysis project. In that post I also began a prototypical data analysis project working my way up through munging of the data. All those steps in (Part 1 of 2) enabled me to begin the analysis and modeling parts of the project. This post picks up and continues with the data analysis which will culminate in a formal write-up of the data analysis demonstrated here.

**Assumptions**

**-**This last part of the series presumes that you have read and implemented part 1 of the series. If not, then see that tutorial here.

**Tidbits**

**Index**

**1. Data Exploration & Model Planning**

a. Load the prepared data created from Part 1 of the series:

> setwd(‘/home/datasci1/projects/pearson/data/munged’);

> pearsonDataMunged <- read.csv(“./pearsonDataMunged.csv”,as.is=TRUE);

b. Get re-acquainted with the shape of the data

> dim(pearsonDataMunged);

[1] 1075 3

Our data frame has 3 columns and 1075 rows.

> names(pearsonDataMunged);

[1] “X” “fatherHeight” “sonHeight”

The column names are “X”, “fatherHeight”, “sonHeight”. The column “X” is actually a meta-data column within the data frame that indicates the row number. This artifact occurred in Part 1, step 9-b. When I wrote out the data frame pearsonDataMunged to the file system, R appended a new column indicating the row. This extra column will not cause any issues. However, I will remove it to demonstrate how one removes a column from a data frame.

> pearsonDataMunged <- pearsonDataMunged[,2-3];

What this command is doing: Create a new data frame (same name as old) using all the rows in the originating data frame and using only columns 2 and 3.

I’m using the sapply() command report on the data types contained in each of the two columns

> sapply(pearsonDataMunged,class);

fatherHeight sonHeight

“numeric” “numeric”

As expected, each column in the remaining data frame is of type “numeric”.

c. Based on the data exploration in part 1 of this series, we determined that the data had a nice symmetric distribution centered around the average height for both the father and son. Further, we also saw that the mean height was also very close to the median height for both the father and the son. This strongly suggests a Normal (or Gaussian) distribution of data.

Create a plot to see how the data takes shape:

> plot(pearsonDataMunged$sonHeight ~ pearsonDataMunged$fatherHeight,

xlab=”Father’s height (in)

Figure (1)”,

ylab=”Son’s height (in)”,

xlim=c(58,78),

ylim=c(58,80),

bty=”l”,

pch=20,

col=”blue”,

main=”Pearson Data”);

d. Next I want to try and fit a Linear Regression using the method of Basic Least Squares. After the regression line has been fit we can test to see if the data in fact fits a normal distribution.

> plot(pearsonDataMunged$sonHeight ~ pearsonDataMunged$fatherHeight,

xlab=”Father’s height (in)

Figure (1)”,

ylab=”Son’s height (in)”,

xlim=c(58,78),

ylim=c(58,80),

bty=”l”,

pch=20,

col=”blue”,

main=”Pearson Data with Fitted Linear Regression Line”);

# Fit a linear regression to the data.

> pearsonLinearModel <- lm(pearsonDataMunged$sonHeight ~ pearsonDataMunged$fatherHeight);

# Add regressionline to scatter plot

> abline(pearsonLinearModel,

lty=1,

lwd=2,

col=’red’);

The linear regression line generated seems to fit nicely right in the middle of all the data points and there is also a symmetry about the line. With linear regression comes the notion of a residual which is the difference between the actual value and the fitted value. Residuals are used to help determine how well the linear fit is to the data. In figure(2) above it is the difference between the blue dot (actual value) and its corresponding point on the red line (best fit).

e. To check for a normal distribution in the data I will first make a plot of the residuals. If the data is normally distributed we would expect to see symmetry in the scatter plot. The next step will be to create a Quantile vs. Quantile (Q-Q) plot which will either validate or invalidate my assumption that the data conforms to a normal distribution.

> par(mfrow=c(1,2))

> plot(pearsonDataMunged$fatherHeight, resid(pearsonLinearModel),

pch=20,

col=”seagreen”,

ylab=”Residuals”,

xlab=”Father’s height (in)

Figure(3)”,

main=”Pearson Height Data (Residuals)”)

> abline(0, 0, col=”red”);

>qqnorm(pearsonLinearModel$residuals,

pch=20,

col=”seagreen”,

main=”Pearson Data: Q-Q Plot”,

xlab=”Theoretical Quantiles

Figure(4)”);

> qqline(pearsonLinearModel$residuals, col=”red”);

f. In figure(3) above we see that the residual plot has a well defined symmetry which again supports the idea that the data is normally distributed. The Q-Q plot in figure (4) shows that, except for the very ends, most of the data points lie on a straight line which indicates that the data has a normal distribution. The end points, as can be seen, deviate from the straight line. However the pattern (left side is under line, right side it above) indicates that the data, albeit normal, has a bit of fatness in the tails on either side of the histogram. It needs to be stated that for Q-Q plots it is not uncommon for the ends to deviate from the straight line. What is important is that an overwhelming majority of the data points do fall on the red line. Figure (5) below was first presented in Part 1 of this series but I want to bring it back to demonstrate that the data does in fact have some fatness in the tails.

> hist(pearsonDataMunged$sonHeight,

col=”blue”,

breaks=100,

xaxt=’n’,

ylab=”Frequency”,

xlab=”Pearson Son Height (in)

Figure (5)”,

main=”Histogram Pearson Son Height”);

> axis(side=1, at=seq(10,80,1), xlim=c(10, 80));

> meansonHeight <- mean(pearsonDataMunged$sonHeight);

> lines(rep(meansonHeight,100),seq(0,150,length=100),col=”red”,lwd=5);

**2. Data Analysis & Model Building**

In the previous section I applied some basic principles of exploratory analysis and concluded that:

– The data has the characteristics of normality. However fatness in the tails exist on either side of the distribution.

– Linear Regression by the method of Basic Least Squares will be used to model the relationship between a son’s height and his father’s height.

In section 1.d above, I calculated a linear regression against the Pearson data. I will now look at some of the details of that fit.

a. Apply the summary() function to get details of the fit

> summary(pearsonLinearModel);

Table (1) above shows a bit of information about the linear fit. However I will only be focusing on a few of them:

– Intercept = 34.05

– Slope = 0.51

– p-Value of intercept < 2e-16

– p-Value of slope < 2e-16

The intercept is the point at which the linear regression line (see figure (2)) crosses the y-axis (for x = 0). The slope is the numerical representation of the rate of change in y as a function of x:

The p-Value is a statistic that makes a statement regarding the likelihood that the null hypothesis is true.

For this analysis:

- The null hypothesis states: There **is no** relationship between a son’s height and his father’s height.

– The alternative hypothesis states : There **is** a relationship between a son’s height and his father’s height.

The p-Value is reported as a number and it is common to make the following statements about p-Values:

- P < 0.05 (Statistically significant) – Strong evidence against the null hypothesis. It is rejected

– P > 0.05 – Strong evidence for the null hypothesis. It is accepted

For this analysis P < 0.001. This implies that the relationship between a son’s height and his father’s height it not by chance. With this I reject and conclude that is valid.

Where:

= Estimated height of the son

= Y intercept of fitted line

= Estimated slope of fitted line

= Measured height of the father

= Is a random component often called the “error” or “disturbance” in an observation. It represents everything we did not measure or take into account. This disturbance can be seen in figure (2). The best line fit does not touch every point in the graph and it and is the difference between the point and the best line fit for any pair of data.

With calculated numerical values from Table(1) above

The generic model is expressed as

Finally, an important measure of the model is the confidence interval. What this tells us is that for a given sample of data what the range of values are that describe the uncertainty surrounding the estimate. The way it works is that the Pearson data set can be broken up into many subsets (random but equal size buckets of data) and for each subset of data a linear regression can be fit. From this you would see different slopes and intercepts estimated for each subset. The confidence interval says that, for a given confidence, say 95% the interval for the slope would lie between “a” and “b”, also for the intercept the interval would lie between “m” and “n”. This would hold true for 95% of the samples and we would expect to see that for 5% of the samples those would lie outside the confidence interval.

In R the confidence interval for a linear regression is calculated using the “confint()” command:

> confint(pearsonLinearModel,level=0.95);

**3. Interpretation & Scrutinization of Results**

In part 1 of this series (Section 3) the question was posed, “Is there a correlation between a father’s height and his son’s height. If so, can a model be built to represent the relationship between the two variables?” All of the analysis leading up to this point was performed to help answer those questions. The analysis performed here demonstrated that the Pearson data behaves as if it is normally distributed. This being the case, the data lends itself nicely to the method of Basic Least Squares to derive a Linear Regression. Given these facts, I was able to construct a model that best approximates the Linear fit to the data.

Based on the Linear Regression fit the following statement can be made regarding the Pearson data:

A one inch increase in father height is associated with a 0.51 inch increase in son height (95% Confidence Interval: 0.45 – 0.56 inches).

Finally, one aspect of this data that is not so obvious from the previous figures and tables is the relationship between father height and son height at various intervals. This phenomena was first observed and reported on by Sir Francis Galton in 1886. What Galton discovered was that for children of short parents the measured height of the child was, on average, greater than that of the parent. For children of tall parents, their height tended to be less than that of the parent. Galton in his famous 1886 paper labeled this phenomenon as “Regression toward Mediocory.” That label was the genesis of the term “Linear Regression” as applied by Galton. Figure (6) below demonstrates this phenomena with the Pearson data.

The R code for this plot is involved and as such I chose not to include it in the blog post. However you can see the code in its entirety by downloading it from section 4 below.

**4. Analysis Write-up**

I chose to perform the write-up as a stand-alone document instead of doing it here in the blog. This provides you the opportunity to see how a formal data analysis write-up could be formatted. There will be variations on how it will look as a function of the audience. Further, if you recall in Part 1 of 2 of this series I introduced you to the idea of creating a folder structure to support your data science projects. That folder structure accommodates the formal write-up and it is there that you will find my final write-up.

Download Complete Data Analysis

I would like to make some comments regarding the write-up. Great care was given to give credit where credit is due. Most of the work done in this analysis depends on work done by the pioneers Galton and Pearson. You will discover that as you perform your own analysis many credits will be given. This will be borne out of the fact that you will will constantly be referencing the work and methods of others. This is how we learn!

**5. Final Thoughts & Comments**

The work Galton and Pearson did in their lifetime is nothing short of astounding. The discovery of Linear Regression is attributed to Galton. The discipline of Mathematical Statistics is attributed to Pearson. These tools are used every day by people across many verticals and disciplines. The whole field of data science rests on the mighty shoulders statistics and analysis.

*Links to the original works by both men:*

Sir Francis Galton (Bio) – Regression Towards Mediocrity in Hereditary Stature

Karl Pearson (Bio) – Grammar of Science

*Other useful links:*

Regression Toward the Mean – Nice explanation

Predicting human height by Victorian and genomic methods – Interesting paper that compares the work done by Galton to modern genomics to explain variance in height.

The goal of this two part series was to demonstrate by example, how to perform a complete data analysis. The repeatable process presented will afford you a disciplined approach that will keep you on the path of (to quote my favorite fictional detective) “Method and Order.” This will facilitate a auditable process – a facet of data science that is very important. You always need to demonstrate how you got to the outcome and it needs to be repeatable!

*Congratulations! You just completed your first data analysis. *

Thank you for such a detailed and informative write-up. It is hard to find quality breakdowns of data analysis procedures like this on the internet.

Anthony, thank you for the kind words. The goal of DataTechBlog is to facilitate transition from “curious about data science” to “understanding data science.”

I believe that the best way to accomplish this is through meticulously detailed tutorials.

Regards, Louis.