Structuring a Data Analysis using R (Part 2 of 2) – Analyzing, Modeling, and the Write-up.

Data Analysis Using R

In the first part (Structuring a Data Analysis using R (Part 1 of 2)) of this two part series, I discussed several key aspects necessary to any successful data analysis project. In that post I also began a prototypical data analysis project working my way up through munging of the data. All those steps in (Part 1 of 2) enabled me to begin the analysis and modeling parts of the project.  This post picks up and continues with the data analysis which will culminate in a formal write-up of the data analysis demonstrated here.

Continue reading

2 Comments

Filed under Foundations, Tutorials

Structuring a Data Analysis using R (Part 1 of 2) – Gathering, Organizing, Exploring and Munging Data.

ExploratoryAnalysisOver the next two tutorials, I am going to walk you through a complete data analysis project.  You will be shown the proper steps necessary to ensure a consistent and repeatable process that can be used for all your data analysis projects.  Simply put, this tutorial’s goal is to create a framework and provide a set of tools that can be used to support any data science project.

Photo via

Continue reading

6 Comments

Filed under Foundations, Tutorials

Data Warehouse ETL Offload with Hadoop.

Data Warehouse ETL Offload with Hadoop

Data volumes are growing at an exponential rate causing problems for traditional IT infrastructures.  As a result, we are seeing more and more organizations taking advantage of emerging technologies, like Hadoop, to help mitigate the pressure of exploding data volumes.  Hadoop and its eco-system of tools play an important role in tackling tough problems that are plaguing traditional IT and data warehouse environments.

Specifically, Extract, Transform, and Load (ETL) can be offloaded to Hadoop to address the problem of exploding data volumes that are breaking traditional IT and ETL processes.  Using screen casts and animated video, I will demonstrate how Hadoop can be used to offload the most taxing ETL workloads.

First, I will demonstrate the overarching problem of ETL overload in a fictitious company called Acme Sales.  Proceeding from there are actual demonstrations (screen casts) of an ETL offload using Hadoop, Hive, and Sqoop.

Before we begin I would like to introduce Noelle Dattilo, the newest guest author on DataTechBlog.  Noelle is an education expert who specializes in animation technology.  Noelle gets full credit for creating all the animations you are about to see in this post.  I asked Noelle to explain her process for creating compelling animations.

“These series of animations are created by an on-line program called VideoScribe, a white board animation tool.  These animations leverage problem-based learning to paint a conceptual picture in a form that is easily digestible to a wide audience.  To create the animations, I found some interesting graphics, turned them into SVG files (that’s the tricky part,) uploaded them into VideoScribe and placed them in the order to be drawn.  Once I uploaded the audio track that Louis recorded, I synced the timing for each animation to be drawn, with the track, and voila` we have our Acme Sales animations.”

Photo via


Continue reading

Leave a Comment

Filed under Big Data Use Cases, Tutorials

Data Lakes for Big Data: A Free Online Course.

DataLakesMooc

Back in January of 2014, I wrote a post describing my first MOOC experience. Also, during this period of time I shared my insights on a new construct called a Data Lake.  The data lake concept has evolved substantially since I first reported on it back in February of 2014. In fact, the ideas around Big Data are in a constant state of flux.  This stuff is evolving at the speed of now!

Fast forward to today  I find myself in a unique position (shaking my head in disbelief) to be part of a team within EMC developing and offering a MOOC called “Data Lakes for Big Data.” As a member of the Big Data Solutions team, I support the training and education portfolio for EMC’s go-to-market strategy for Big Data across EMC’s federation of companies.

As one of the MOOC’s instructors, I can provide you a bit more insight into the course but first I want each and every one of you to sign up right here:  Data Lakes for Big Data. The class is now open and is being delivered asynchronously, meaning you can consume the material at your convenience.

The overarching goal of this MOOC is take a person who has no familiarity with Big Data, Data Science, and Data Lakes and give them a basic foundation of knowledge from which they can grow. The MOOC is broken up into four 1 week sessions, with each week introducing a new topic:

  • Week 1: What is Big Data and Data Science?
  • Week 2: What is the Value of Big Data and Big Data Analytics?
  • Week 3: What is a Data Lake?
  • Week 4: How is a Data Lake Operationalized?

This online course is for newbies, there are no pre-requisites outside of a genuine interest in Big Data and a willingness to learn. In this course you will see videos from today’s top thought leaders speaking on Big Data and Data Science, including EMC’s Big Data Solutions CTO Chris Harrold, EMC’s Data Science guru David Dietrich, and none other than the Dean of Big Data himself, EMC’s Bill Schmarzo!

All those who finish the MOOC (with a passing grade of 70 or above)  will receive a certificate of completion.

I look forward to seeing you in the course.

Regards, Louis.

Leave a Comment

Filed under Education & Instruction

Data Driven – Creating a Data Culture (DJ Patel, Hilary Mason)

cat DJ Patel (of RelateIQ and LinkedIn fame) and Hilary Mason (of Fast Forward Labs and Bitly fame) recently released a free eBook that speaks to the merits of designing and enabling a data driven culture within an organization.

“A data driven organization acquires, processes, and leverages data in a timely fashion to create efficiencies, iterate on and develop new products, and navigate the competitive landscape.”

These two thought leaders in the big data and data science space bring together years of experience and deep knowledge to offer their views on:


  • What a data scientist should be
  • What defines a data driven organization and what they do well
  • The importance of democratizing data
  • Designing a data driven organization and how to manage research
  • Being the change agent for data success

The authors also stress that gut instinct has a key role to play in a data driven organization:

“One word of caution: don’t follow the data blindly. Being data driven
doesn’t mean ignoring your gut instinct. This is what we call letting
the data drive you off a cliff.”

I highly recommend that you take the time to read this brief but pithy eBook.

Get your copy of the eBook here: Data Driven – Creating a Data Culture.

You will be asked for your name and email prior to download.

Regards, Louis.

Leave a Comment

Filed under Suggested Readings

My New Year’s Resolution Impetus: Two Great Reads

Deep Knowledge

 

While reading two books over the holiday break, I was inspired to demand more of myself and in turn, lay the foundation for my New Year’s resolution.

So Good they can’t Ignore You (Cal Newport)

5 Elements of Effective Thinking (E. Burger, M. Starbird)

 

These two books, each in their own way, identify and describe core characteristics of people who are not only highly successful, but happy as well.  These people don’t label their work as a job nor as a career.  They would tell you that their work is a calling.

What jumped out immediately while reading these books was that each identified “deep knowledge” as a fundamental underpinning necessary for career and personal success.  Most, if not all, “true” experts of a discipline take the time to master their subject. They aren’t afraid of failure; in fact, they welcome it.  Failure provides so much useful feedback that it actually can be used as a guide for success.  C.S. Lewis said it best: “Failures are finger posts on the road to achievement.”

Deep Knowledge

“Be your own Socrates” is a driving principle in 5 Elements of Effective Thinking.  You should never stop asking questions and always be critical of your own thought process.  In my professional career, I have bumped into too many folks who simply want to be right, to have all the answers, and to be the smartest people in the room (big yawn!).  Fortunately, I have also encountered people who care only about deep knowledge and seeking truth.  Deep knowledge demands incessant questioning, meaningful and relevant dialogue, and the ability to put your ego in your own back pocket for the sake of unearthing truth.

Aspiring to deep knowledge has an important side effect of building rare and valuable skills.  This is a fundamental premise of So Good They Can’t Ignore You.  The author proffers the idea that success is not achieved from pursuing your passion, but rather results from going all in and striving to be a true craftsman.  The deeper you learn the better you become at it, and this is what leads to true passion.  There is a great line in the book, “Following your passion is flawed, and can be harmful – leading to frequent job/career changes and anxiety/angst.”  I have been guilty of chasing something I thought was a passion only to find it was incompatible to me, or it entailed a profound focus and energy outlay for which I was unprepared.

My take away from these books mimics a phrase we have heard our entire lives from our parents, friends, and wise elders, “If you are going to do something, do it well.”  Easy to say, much harder to do!

My New Year’s resolution is to strive for excellence in my professional endeavors.  I will not rest on my laurels, and most importantly, the proverbial foot shall not be removed from the accelerator!

Cheers and Happy New Year!

Louis.

2 Comments

Filed under Suggested Readings

The History and Use of R

The History and Use of R


Recently I attended a great lecture on the statistical programming language R. Titled “The History and Use of R,” this talk was was held at HackReduce in Cambridge, Massachusetts and was sponsored by MediaMath.  The lecturer, Joe Kambourakis, is a colleague of mine and is the lead Data Science instructor at EMC Educational Services.  Joe is also a talented Data Scientist.

He did a great job of putting together the genesis and evolution of what is one of the hottest programming languages for statistics and graphics today.  If you are a practitioner of R, then I encourage you to check out this presentation.

Photo via


Continue reading

Leave a Comment

Filed under Education & Instruction

Operationalizing a Hadoop Eco-System (Part 3: Installing and using Hive)


Hadoop Hive


In part 1 of this series, I demonstrated how to install, configure, and run a three node Hadoop cluster. In part 2, you were shown how to take the default “word count” YARN job that comes with Hadoop 2.2.0 and make it better.  In this leg of the journey, I will demonstrate how to install and run Hive.  Hive is a tool that sits atop Hadoop and facilitates YARN (next generation map-reduce) jobs without having to write Java code.  With HIVE, and its scripting language HiveQL, querying data across HDFS is made simple.  HiveQL is a SQL like scripting language which enables those with SQL knowledge immediate access to data in HDFS.  HiveQL also lets you reference custom MapReduce scripts right in HiveQL queries.

Photo via


Continue reading

1 Comment

Filed under Big Data, Infrastructure, Tutorials

The Application of Analytics in Healthcare

Analytics Healthcare

The application of analytics in healthcare has been transforming over the past five to six years.  Prior to this transformation, analytics applied to patient data were mostly descriptive in nature.  That is to say, the simple reports generated by healthcare providers were basic and only told the story of “what happened.”  In this era of big data, more and more healthcare organizations are looking to take advantage of their data in a more meaningful way.  Their goal is to extract business relevant information that enables providers, managers, and executives to derive actionable insight from their data.  Recently,  I had the pleasure of researching this topic for a graduate class I took.  I feel strongly that we are seeing a paradigm shift in how providers and payers are looking at their data (both structured and unstructured).  This research addresses the key issues facing the healthcare industry today as well as in the future.

Photo via

Continue reading

4 Comments

Filed under Education & Instruction

Operationalizing a Hadoop Eco-System (Part 2: Customizing Map Reduce)


Hadoop Map Reduce

It gives me great pleasure to introduce a new contributor to DataTechBlog.  Ms. Neha Sharma makes her debut with this blog post.  Neha is a talented software engineer and big data enthusiast.  In this post, she will be demonstrating how to enhance the “word count” map reduce job that ships with hadoop.   The enhancements will include the removal of “stop” words, the option for case insensitivity and the removal of punctuation.

In part 1 of this series you were shown how to install and configure a hadoop cluster.  Here you will be shown how to modify a map reduce job. In this case the job to be modified is the word count example that ships with hadoop.

photo via
Continue reading

1 Comment

Filed under Big Data