In the last three tutorials (Tutorial 1, Tutorial 2, Tutorial 3), I demonstrated how to create an infrastructure to support data science projects. Next in the evolution is to show you how you can load data into Greenplum and R for analysis. For this tutorial I am using the famous Fisher Iris data set. This data is most often used to demonstrate how discriminant analysis can be used to manifest obvious similarities and dissimilarities of objects, and in the case of the Fisher Iris data set, three species of Iris. I chose this particular data because we will be using it in a tutorial in the near future.
Tag Archives: Greenplum
Building an Infrastructure to Support Data Science Projects (Part 3 of 3) – Installing and Configuring R / RStudio with Pivotal Greenplum Integration
In this third and final part (Part 1 of 3, Part 2 of 3) of the series, I walk you through the installation and configuration of R and RStudio. I also demonstrate how R is integrated with Pivotal Greenplum. For those of you who don’t know what R is, you can go here for a lot of useful information. In short, R is a scripting language and runtime environment used for performing complex (or simple) statistical analysis of data. This tool is available for free under the GNU General Public License. RStudio is a free and open source IDE for R. You can go here for more information about RStudio.
Building an Infrastructure to Support Data Science Projects (Part 2 of 3) – Installing Greenplum with MADlib
In the first part of this series (Part 1 of 3) we installed and configured CentOS on a virtual machine. This laid the foundation and made ready an environment that will now be used to install Pivotal Greenplum Community Edition. This edition allows for any use on a single node per Pivotal’s license model. Also, as part of this tutorial I will be demonstrating how to install MADlib (open-source) libraries into Greenplum. MADlib provides a rich set of libraries for advanced in-database data analysis and mining which can be called via regular SQL. The installation of Greenplum and MADlib will facilitate some of the data science excercises I will be demonstrating in the near future.
Building an Infrastructure to Support Data Science Projects (Part 1 of 3) – Creating a Virtualized Environment.