Tag Archives: Data Science

Statistics in Decision Making

Gut Feeling

Statistics & Business Decision Making

The role of statistics in business can be traced back hundreds of years.   As early as 744 AD, statistics were used by Gerald of Wales to complete the first population census of Wales (1).  It wasn’t long before merchants realized that statistics could be used to measure and quantify trade.  The first record of this was in Florence.  It was recorded in Giovanni Villani’s “Nuova Cronica”, in 1346 (1).  Moreover, statistical methods were further adopted to help drive quality and in doing so helped contribute to the advancement of statistics itself.  In 1504, William Sealy Gosset, chief brewer for Guinness in Dublin, devised the t-test (2) to measure consistency between batches of stout (1).

Photo via


Continue reading

Leave a Comment

Filed under Decisioning, Education & Instruction, Main

Structuring a Data Analysis using R (Part 2 of 2) – Analyzing, Modeling, and the Write-up.

Data Analysis Using R

In the first part (Structuring a Data Analysis using R (Part 1 of 2)) of this two part series, I discussed several key aspects necessary to any successful data analysis project. In that post I also began a prototypical data analysis project working my way up through munging of the data. All those steps in (Part 1 of 2) enabled me to begin the analysis and modeling parts of the project.  This post picks up and continues with the data analysis which will culminate in a formal write-up of the data analysis demonstrated here.

Continue reading

2 Comments

Filed under Foundations, Tutorials

Structuring a Data Analysis using R (Part 1 of 2) – Gathering, Organizing, Exploring and Munging Data.

ExploratoryAnalysisOver the next two tutorials, I am going to walk you through a complete data analysis project.  You will be shown the proper steps necessary to ensure a consistent and repeatable process that can be used for all your data analysis projects.  Simply put, this tutorial’s goal is to create a framework and provide a set of tools that can be used to support any data science project.

Photo via

Continue reading

6 Comments

Filed under Foundations, Tutorials

Training for the Aspiring Data Scientist: My Experience with a MOOC.

In the inaugural post of DataTechBlog, I stated a goal of helping others learn more about the emerging field of data science and big data analytics.  It was my original intent to tackle this lofty goal primarily via instructive tutorials.  However, as of late, I realize just how important formal instruction is to the learning process.  Prior to my professional career in data, I spent the better part of 12 years in college earning various degrees and taking many (extra) classes.  The education process afforded me an opportunity to saturate myself in various topics.  In the milleau of the classroom setting, many opportunities abounded and I, having been a motivated student, was able to run with the proverbial ball.  I guess my point here is that  I (having been out of college for a while) forgot just how important classroom learning is to foster expertise in a particular subject and/or discipline.  As such,  I signed up for and successfully completed my first MOOC (Massive, Open, Online Course).

Photo via

Continue reading

1 Comment

Filed under Education & Instruction

Greenplum, R, Rstudio, and Data. The Basic Ingredients for Successful Recipes.

IngredientsIn the last three tutorials (Tutorial 1, Tutorial 2, Tutorial 3), I demonstrated how to create an infrastructure to support data science projects.  Next in the evolution is to show you how you can load data into Greenplum and R for analysis. For this tutorial I am using the famous Fisher Iris data set.  This data is most often used to demonstrate how discriminant analysis can be used to manifest obvious similarities and dissimilarities of objects, and in the case of the Fisher Iris data set, three species of Iris.  I chose this particular data because we will be using it in a tutorial in the near future.

Continue reading

Leave a Comment

Filed under Foundations, Tutorials

Beyond the Basics – Data Science for Business (Foster Provost & Tom Fawcett).

Data Science For BusinessIn an earlier post, I recommended a short read on introductory data science and big data. That book gave a fantastic overview of the major areas and ideas governing these disciplines. However, if you are motivated to dig deeper and wrap your head around details, then Data Science for Business should be your next read. This book does a fantastic job of helping the reader understand how one should think if they are considering data science as a profession, or they want to understand all the hype. Further, much detail and time is given to the idea of the “Data Analytics Lifecycle” which governs data science projects through process and a framework.  The authors meticulously step through the various modeling techniques with solid examples and explanations.  There are sections of the book that detail some math and their derivations which may prove to be challenge if your math is rusty. However it should not present too much of an obstacle with regards to understanding the gist of what is being conveyed.  In the preface, the authors state the book is intended for business people who are working with data scientists, managing data scientists, or seeking to understand the value in data science. Also, the book is suited to developers implementing data science solutions and finally, aspiring data scientists.  I believe that this book has a role to play in one’s education in data science and that it is an appropriate read for those wishing to understand, with detail, how data science is done and what it aims to achieve.


Louis V. Frolio

2 Comments

Filed under Suggested Readings

Building an Infrastructure to Support Data Science Projects (Part 2 of 3) – Installing Greenplum with MADlib

Installing GreenplumIn the first part of this series (Part 1 of 3) we installed and configured CentOS on a virtual machine.  This laid the foundation and made ready an environment that will now be used to install Pivotal Greenplum Community Edition. This edition allows for any use on a single node per Pivotal’s license model.  Also, as part of this tutorial I will be demonstrating how to install MADlib (open-source) libraries into Greenplum.  MADlib provides a rich set of libraries for advanced in-database data analysis and mining which can be called via regular SQL. The installation of Greenplum and MADlib will facilitate some of the data science excercises I will be demonstrating in the near future.

Continue reading

2 Comments

Filed under Infrastructure, Tutorials

Building an Infrastructure to Support Data Science Projects (Part 1 of 3) – Creating a Virtualized Environment.

Construction

As with any project or experiment,  infrastructure has to be in place to support the intended work.  For the case of a data science project, the obvious first step is the computing environment.  Simply stated, you can’t do advanced analytics on large data sets without CPU, RAM and Disk. With these items as your foundation, much can be designed, engineered and built.  Before we can walk through a data science project we need to first have hardware and software in place.  For the purposes of the tutorials here on DataTechBlog, a P.C. or laptop with adequate CPU, RAM and disk will suffice.  Further, it is  my plan to use only open or free software and code for all tutorials. You need only a reasonably spec’d computer to accomplish all that we will do here.  This tutorial will walk you through the installation of VMWare Player  and CentOS 6.x (Optimized for Pivotal Greenplum).  This lays the foundation for the next steps which will include the installation of Pivotal Greenplum, MADlib libraries, R, and R-Studio.  When this environment is complete, you will be able to perform many types of “in database” analysis using SQL with MADlib, analysis using R with Greenplum, and analysis with R against flat files or manually entered data.

Continue reading

Leave a Comment

Filed under Infrastructure, Tutorials

A Great Introduction to Data Science and Big Data.

SimpleIntroDataScienceFor those of you foraying into the world of Big Data (BD) and Data Science (DS), it can be challenging to find a single resource to help paint a meaningful high-level picture of what this stuff is all about.  Personally, I always like to start with a 30,000 foot view of the challenge or endeavor before me.  I find that it helps frame the important concepts better enabling the consumption and digestion of the details to follow.  This tiered approach is especially important to the disciplines of BD and DS.  A book I read in less than 45 minutes completely satisfied my 30,000 foot criteria.  The key to this book’s success is the organic progression of each chapter, the breadth of topics introduced and its overall brevity.  The authors (in a mere 65 pages) walk you through a summary of data science, a working definition of big data, the new technologies necessitated by big data, aspects of the data analytics lifecycle, key characteristics of a data scientist and approaches to effective communication as a data scientist.  If you have an interest in DS or BD, get your hands on this book.  It provides a simple overview of the complicated disciplines of data science and big data.

Louis V. Frolio

Leave a Comment

Filed under Suggested Readings

The Inaugural Post.

Roosevelt

Welcome to DataTechBlog. My name is Louis and I am a data professional.  I espouse all data: big, small, structured, semi-structured, unstructured, dark, sensor, I do not discriminate. For the past 20 years I have gained expertise in many aspects of data including, analytics, management, operations, architecture, technology, administration, and engineering.
Over the past several years the terms “data science” and “big data” have become commonplace. My goal is to help other data and database professionals learn about the emerging disciplines of data science and big data analytics. Here you will find tutorials, how to’s and topic discussions on various dimensions of these disciplines including data mining, exploratory data analysis, data prep/scrubbing, data engineering, tools (e.g. Greenplum, R, MADlib, Hadoop, Hive, Pig, etc.), visualizations, and much more.
Coming from a traditional data architecture background, I can help bridge the gap for people who work with RDBMS technologies who are interested in learning more about data science and big data analytics.

Regards, Louis.

2 Comments

Filed under Home