The Application of Analytics in Healthcare

Analytics Healthcare

The application of analytics in healthcare has been transforming over the past five to six years.  Prior to this transformation, analytics applied to patient data were mostly descriptive in nature.  That is to say, the simple reports generated by healthcare providers were basic and only told the story of “what happened.”  In this era of big data, more and more healthcare organizations are looking to take advantage of their data in a more meaningful way.  Their goal is to extract business relevant information that enables providers, managers, and executives to derive actionable insight from their data.  Recently,  I had the pleasure of researching this topic for a graduate class I took.  I feel strongly that we are seeing a paradigm shift in how providers and payers are looking at their data (both structured and unstructured).  This research addresses the key issues facing the healthcare industry today as well as in the future.

Photo via

Continue reading

3 Comments

Filed under Education & Instruction

Operationalizing a Hadoop Eco-System (Part 2: Customizing Map Reduce)


Hadoop Map Reduce

It gives me great pleasure to introduce a new contributor to DataTechBlog.  Ms. Neha Sharma makes her debut with this blog post.  Neha is a talented software engineer and big data enthusiast.  In this post, she will be demonstrating how to enhance the “word count” map reduce job that ships with hadoop.   The enhancements will include the removal of “stop” words, the option for case insensitivity and the removal of punctuation.

In part 1 of this series you were shown how to install and configure a hadoop cluster.  Here you will be shown how to modify a map reduce job. In this case the job to be modified is the word count example that ships with hadoop.

photo via
Continue reading

Leave a Comment

Filed under Advanced Topics, Big Data

Modern Data Architecture: The Data Lake

Modern Data ArchitectureToday, one of the hottest topics out there is “big data.”  It seems that everybody is talking about it and more and more companies are throwing their hat into the big data ring.  These are exciting times because there is a fundamental shift in how we think about data.  Not that long ago, structured data reigned supreme.  For data architects, the methods of handling data (transactional and dimensional) were based in sound theory thanks to E.F. Codd (Relational Modeling), Bill Inmon (Top Down 3NF Design), Ralph Kimball (Dimensional Modeling) and Daniel Linstedt (Data Vault Architecture).  We are now living in the post-relational world where the majority of the data (estimates have it at 80%) being generated is either semi-structured, quasi-structured or unstructured (1).  Further, this data is growing at a rapid rate.  As of 2012 ,digital content is being created at a rate of 2.5 quintilion ( 1 with 18 trailing zeros) bytes of data each day!(2)  Moreover, between 2009 and 2020 we can expect to see a 44 fold increase in all digital content. Of this data only 5% will be classified as structured (3).  So, with all those impressive stats, the question staring us in the face is this: ” How do we manage this deluge of unstructured data and how do we get it to play nice with structured data?”  Enter the Data Lake!

Photo via

Continue reading

Leave a Comment

Filed under Big Data

Operationalizing a Hadoop Eco-System (Part 1: Installing & Configuring a 3-node Cluster)

hadoop eco-systemThe objective of DataTechBlog is to bring the many facets of data, data tools, and the theory of data to those curious about data science and big data.  The relationship between these disciplines and data can be complex.  However, if careful consideration is given to a tutorial, it is a practical expectation that the layman can be brought online quickly.  With that said, I am extremely excited to bring this tutorial on the Hadoop Eco-system.  Hadoop & MapReduce (at a high level) are not complicated ideas.  Basically, you take a large volume of data and spread it across many servers (HDFS).  Once at rest, the data can be acted upon by the many CPU’s in the cluster (MapReduce).  What makes this so cool is that the traditional approach to processing data (bring data to cpu) is flipped.  With MapReduce, CPU is brought to the data.  This “divide-and-conquer” approach makes Hadoop and MapReduce indispensable when processing massive volumes of data.  In part 1 of this multi-part series, I am going to demonstrate how to install, configure and run a 3-node Hadoop cluster.  Finally, at the end I will run a simple MapReduce job to perform a unique word count of Shakespeare’s Hamlet.  Future installments of this series will include topics such as: 1. Creating an advanced word count with MapReduce, 2. Installing and running Hive, 3. Installing and running Pig, 4. Using Sqoop to extract and import structured data into HDFS.  The goal is to illuminate all the popular and useful tools that support Hadoop.

Photo via

Continue reading

Leave a Comment

Filed under Big Data, Infrastructure, Tutorials

Training for the Aspiring Data Scientist: My Experience with a MOOC.

In the inaugural post of DataTechBlog, I stated a goal of helping others learn more about the emerging field of data science and big data analytics.  It was my original intent to tackle this lofty goal primarily via instructive tutorials.  However, as of late, I realize just how important formal instruction is to the learning process.  Prior to my professional career in data, I spent the better part of 12 years in college earning various degrees and taking many (extra) classes.  The education process afforded me an opportunity to saturate myself in various topics.  In the milleau of the classroom setting, many opportunities abounded and I, having been a motivated student, was able to run with the proverbial ball.  I guess my point here is that  I (having been out of college for a while) forgot just how important classroom learning is to foster expertise in a particular subject and/or discipline.  As such,  I signed up for and successfully completed my first MOOC (Massive, Open, Online Course).

Photo via

Continue reading

Leave a Comment

Filed under Education & Instruction

Structuring a Data Analysis using R (Part 2 of 2) – Analyzing, Modeling, and the Write-up.

Data Analysis Using R

In the first part (Structuring a Data Analysis using R (Part 1 of 2)) of this two part series, I discussed several key aspects necessary to any successful data analysis project. In that post I also began a prototypical data analysis project working my way up through munging of the data. All those steps in (Part 1 of 2) enabled me to begin the analysis and modeling parts of the project.  This post picks up and continues with the data analysis which will culminate in a formal write-up of the data analysis demonstrated here.

Continue reading

2 Comments

Filed under Foundations, Tutorials

DataTechBlog’s New Look and Feel

DataTechBlogNewLook
I am pleased to present the new look and feel of DataTechBlog. I felt the blog needed to speak a bit more to my personality and taste.  I am thrilled with the work done by a great UX designer. James Brown is outstanding at his craft – a tireless perfectionist who has an eye for utility, system values, and user-centered design. He worked with me and my whimsy for weeks as I put him through iteration after iteration of the DataTechBlog branding. The result, as can be seen above in the masthead, is an expression of the abstraction of pure data.  Also changed in the blog, was the implementation of Google Web Fonts. This along with a multitude of small changes throughout the site culminates in a synergy that helps drive home the vision of DataTechBlog.

James can be reached here.

Louis V. Frolio

Leave a Comment

Filed under Home, Main

Structuring a Data Analysis using R (Part 1 of 2) – Gathering, Organizing, Exploring and Munging Data.

ComponentsUp until this point, all instructional posts were tutorials on setting up an infrastructure and readying an environment for data science projects.  Over the next two tutorials, I am going to walk you through a complete data analysis project.  You will be shown the proper steps necessary to ensure a consistent and repeatable process that can be used for all your data analysis projects.  Simply put, this tutorial’s goal is to create a framework and provide a set of tools that can be used to support any data science project.

Continue reading

5 Comments

Filed under Foundations, Tutorials

Greenplum, R, Rstudio, and Data. The Basic Ingredients for Successful Recipes.

IngredientsIn the last three tutorials (Tutorial 1, Tutorial 2, Tutorial 3), I demonstrated how to create an infrastructure to support data science projects.  Next in the evolution is to show you how you can load data into Greenplum and R for analysis. For this tutorial I am using the famous Fisher Iris data set.  This data is most often used to demonstrate how discriminant analysis can be used to manifest obvious similarities and dissimilarities of objects, and in the case of the Fisher Iris data set, three species of Iris.  I chose this particular data because we will be using it in a tutorial in the near future.

Continue reading

Leave a Comment

Filed under Foundations, Tutorials

Beyond the Basics – Data Science for Business (Foster Provost & Tom Fawcett).

Data Science For BusinessIn an earlier post, I recommended a short read on introductory data science and big data. That book gave a fantastic overview of the major areas and ideas governing these disciplines. However, if you are motivated to dig deeper and wrap your head around details, then Data Science for Business should be your next read. This book does a fantastic job of helping the reader understand how one should think if they are considering data science as a profession, or they want to understand all the hype. Further, much detail and time is given to the idea of the “Data Analytics Lifecycle” which governs data science projects through process and a framework.  The authors meticulously step through the various modeling techniques with solid examples and explanations.  There are sections of the book that detail some math and their derivations which may prove to be challenge if your math is rusty. However it should not present too much of an obstacle with regards to understanding the gist of what is being conveyed.  In the preface, the authors state the book is intended for business people who are working with data scientists, managing data scientists, or seeking to understand the value in data science. Also, the book is suited to developers implementing data science solutions and finally, aspiring data scientists.  I believe that this book has a role to play in one’s education in data science and that it is an appropriate read for those wishing to understand, with detail, how data science is done and what it aims to achieve.


Louis V. Frolio

2 Comments

Filed under Suggested Readings