Building an Infrastructure to Support Data Science Projects (Part 2 of 3) – Installing Greenplum with MADlib

Installing GreenplumIn the first part of this series (Part 1 of 3) we installed and configured CentOS on a virtual machine.  This laid the foundation and made ready an environment that will now be used to install Pivotal Greenplum Community Edition. This edition allows for any use on a single node per Pivotal’s license model.  Also, as part of this tutorial I will be demonstrating how to install MADlib (open-source) libraries into Greenplum.  MADlib provides a rich set of libraries for advanced in-database data analysis and mining which can be called via regular SQL. The installation of Greenplum and MADlib will facilitate some of the data science excercises I will be demonstrating in the near future.

Continue reading

2 Comments

Filed under Infrastructure, Tutorials

Building an Infrastructure to Support Data Science Projects (Part 1 of 3) – Creating a Virtualized Environment.

Construction

As with any project or experiment,  infrastructure has to be in place to support the intended work.  For the case of a data science project, the obvious first step is the computing environment.  Simply stated, you can’t do advanced analytics on large data sets without CPU, RAM and Disk. With these items as your foundation, much can be designed, engineered and built.  Before we can walk through a data science project we need to first have hardware and software in place.  For the purposes of the tutorials here on DataTechBlog, a P.C. or laptop with adequate CPU, RAM and disk will suffice.  Further, it is  my plan to use only open or free software and code for all tutorials. You need only a reasonably spec’d computer to accomplish all that we will do here.  This tutorial will walk you through the installation of VMWare Player  and CentOS 6.x (Optimized for Pivotal Greenplum).  This lays the foundation for the next steps which will include the installation of Pivotal Greenplum, MADlib libraries, R, and R-Studio.  When this environment is complete, you will be able to perform many types of “in database” analysis using SQL with MADlib, analysis using R with Greenplum, and analysis with R against flat files or manually entered data.

Continue reading

Leave a Comment

Filed under Infrastructure, Tutorials

A Great Introduction to Data Science and Big Data.

SimpleIntroDataScienceFor those of you foraying into the world of Big Data (BD) and Data Science (DS), it can be challenging to find a single resource to help paint a meaningful high-level picture of what this stuff is all about.  Personally, I always like to start with a 30,000 foot view of the challenge or endeavor before me.  I find that it helps frame the important concepts better enabling the consumption and digestion of the details to follow.  This tiered approach is especially important to the disciplines of BD and DS.  A book I read in less than 45 minutes completely satisfied my 30,000 foot criteria.  The key to this book’s success is the organic progression of each chapter, the breadth of topics introduced and its overall brevity.  The authors (in a mere 65 pages) walk you through a summary of data science, a working definition of big data, the new technologies necessitated by big data, aspects of the data analytics lifecycle, key characteristics of a data scientist and approaches to effective communication as a data scientist.  If you have an interest in DS or BD, get your hands on this book.  It provides a simple overview of the complicated disciplines of data science and big data.

Louis V. Frolio

Leave a Comment

Filed under Suggested Readings

The Inaugural Post.

Roosevelt

Welcome to DataTechBlog. My name is Louis and I am a data professional.  I espouse all data: big, small, structured, semi-structured, unstructured, dark, sensor, I do not discriminate. For the past 20 years I have gained expertise in many aspects of data including, analytics, management, operations, architecture, technology, administration, and engineering.
Over the past several years the terms “data science” and “big data” have become commonplace. My goal is to help other data and database professionals learn about the emerging disciplines of data science and big data analytics. Here you will find tutorials, how to’s and topic discussions on various dimensions of these disciplines including data mining, exploratory data analysis, data prep/scrubbing, data engineering, tools (e.g. Greenplum, R, MADlib, Hadoop, Hive, Pig, etc.), visualizations, and much more.
Coming from a traditional data architecture background, I can help bridge the gap for people who work with RDBMS technologies who are interested in learning more about data science and big data analytics.

Regards, Louis.

2 Comments

Filed under Home