Greenplum, R, Rstudio, and Data. The Basic Ingredients for Successful Recipes.

IngredientsIn the last three tutorials (Tutorial 1, Tutorial 2, Tutorial 3), I demonstrated how to create an infrastructure to support data science projects.  Next in the evolution is to show you how you can load data into Greenplum and R for analysis. For this tutorial I am using the famous Fisher Iris data set.  This data is most often used to demonstrate how discriminant analysis can be used to manifest obvious similarities and dissimilarities of objects, and in the case of the Fisher Iris data set, three species of Iris.  I chose this particular data because we will be using it in a tutorial in the near future.

Assumptions
– This part of the series assumes you are running Pivotal Greenplum Community Edition.  If you followed my tutorial on the installation and configuration of Pivotal Greenplum,  then you are good to go.  Also presumed is that you have R, and RStudio installed, configured, and integrated with Pivotal Greenplum.  If not, then see my tutorial here.

– All Linux O.S. commands preceded by “#” implies “run as root.”

– Be conscious that some commands wrap due to blog templating. An example of this is Step 1.a.  What appears to be two lines of text is actually one variable (PATH) assignment.

Index
1. Enable environment & install IDE GUI for Greenplum database: Steps 1 – 4
2. Greenplum – Create an external  and internal table of data: Step 5
3. Greenplum – Querying external and internal tables: Step 6
4. R – Connecting to and querying Greenplum tables (ext and int): Step 7
5. RStudio – Connecting to and querying Greenplum tables (ext and int): Step 8
6. RStudio – Read in CSV file of data: Step 9
7. RStudio – Manually input data: Step 10


Let’s Begin

Step 1. In my tutorial on installing R and RStudio I created a Linux O.S. user called “datasci1.” I will be using that user for this tutorial.  If you created another user, or you have an existing user you would like to use, you just need to adjust where appropriate.

a. Modify user’s .bash_profile file
Modify “PATH” variable:
PATH=$PATH:/opt/greenplum-db/ext/python/bin:$HOME/bin:/opt/greenplum-db/bin

b. Modify user’s .bashrc file:
Add: source /opt/greenplum-db/greenplum_path.sh

Step 2. Enable the Linux O.S. to accept remote connections to Greenplum.

# iptables -A INPUT -p tcp – -dport 5432 -j ACCEPT
# service iptables save
This will open the port that Greenplum is listening on.
** Note that there is a hyphen hyphen in front of “dport”. No space.

Step 3. Enable Greenplum to accept remote connections.

Add to pg_hba.conf the following two lines:

host all datauser 0.0.0.0/0 md5
host all gpadmin 0.0.0.0/0 md5

Load changes (run as O.S. user gpadmin):
$ pg_ctl -D /opt/greenplum-db/gpmaster/gpdata-1 reload

This will allow the users “datauser” and “gpadmin” to connect from any host so long as they supply the password. This is not a very secure method but because we are working on a local VM it should be OK.

Step 4. Installing a GUI tool to the Greenplum database will help make running SQL and DDL commands much easier. I recommend pgAdmin III.  It is open source and free to use.

a. Download pgAdmin III for Windows.
(Version 1.18.0 was the latest at the time of this install.)
This install is quite simple. Unzip the download and run the pgadmin3.msi file.

b. Once installed launch pgAdmin III.

c. Click on the “Power Cord” icon in the upper left hand corner. This is the server registration icon.

– In the Name box type “Greenplum”
– In the Host box type <IP of your VM>
– Take default value of “postgres” for Maintenance DB
– In the username box, type “datauser”. Recall we created this user in tutorial 2 of my infrastructure series.
– In the password box, type the password for “datauser”
– Click OK
– If you are prompted with a “Warning” saying that pgAdmin III does not support the version of Greenplum we have installed you can ignore it.
– If you followed tutorial 3 you should see a database named “sandbox” in the tree view of pgAdmin III

Step 5. Create an external and an internal Greenplum table

a. On the Linux VM as the “datasci1” user issue the following command:
$ gpfdist -d /home/datasci1 -p 8080 &

This will start the Greenplum gpfdist which is the parallel file distribution program used by Greenplum.

b. Now, using your favorite sftp client move the file “fisherdata” to the home directory of “datasci1” on the Linux VM.
– Give the file the permissions: rw-r-r:
$ chmod 644 /home/datasci1/fisherdata.csv

c. Log into pgAdmin III as the user “gpadmin” and issue the following command:
“alter role datauser CREATEEXTTABLE;”
– This will enable the database user “datauser” to be able to create external tables.

d. Log into pgAdmin III as the user “datauser” and navigate to the “sandbox” database and run the code in tut4_ddl1 against the “sandbox” database.

e. Next we will create an internal table in Greenplum based on the external table we just created.  Log into pgAdmin III as the user “datauser” (if you are not already there) and navigate to the “sandbox” database and run the code in tut4_ddl2 against the “sandbox” database.

Step 6. Greenplum – Querying external and internal tables

a. Log into Greenplum via the command line psql utility
$ psql -d sandbox -h Analytics1 -p 5432 -U datauser

b. At the psql prompt issue command to change to the “sandbox” database
sandbox=> \c sandbox

c. Issue command to get row count from fisherdata_external
sandbox=> select count(1) from public.fisherdata_external;

d. Issue command to get row count from fisherdata_internal
sandbox=> select count(1) from public.fisherdata_internal;

You will notice that both tables produce the same row count.  Logically the two tables are the same,  but physically they are not.

Step 7.  R – Connecting to and querying Greenplum external and internal tables

a. Log into the Linux VM as “datasci1” and bring up a console.  Issue the “R” command. This will launch the R app.
$ R
R version 3.0.1 (2013-05-16) — “Good Sport”
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)
>

b. Install RODBC package
> install.packages(“RODBC”);
You may be prompted to pick a mirror site from which to download.

c. Once the package is installed you need to load the libraries
-> library(RODBC);
You may be prompted to pick a mirror site from which to download.

d. Create connection to Greenplum database
> con<-odbcConnect(“Greenplum”);

This step requires that you followed tutorial 3.
You may be prompted to pick a mirror site from which to download.

e. Query database
> sqlQuery(con,”select version();”);
version
1 PostgreSQL 8.2.15 (Greenplum Database 4.2.2.4

f. Run queries against Greenplum through R
> sqlQuery(con,”select count(1) from public.fisherdata_internal;”);
count
150
> sqlQuery(con,”select count(1) from public.fisherdata_external;”);
count
150

The two queries produce identical output as was expected.

Step 8. RStudio – Connecting to and querying Greenplum external and internal tables
In tutorial 3 I configured RStudio to integrate with Greenplum. If you skipped that tutorial please review it to ensure your environment is properly configured.

a. Launch URL for RStudio in your favorite web browser

b. Create connection to Greenplum database
> con<-odbcConnect(“Greenplum”);

c. Run queries against Greenplum through RStudio
> sqlQuery(con,”select count(1) from public.fisherdata_internal;”);
count
150
> sqlQuery(con,”select count(1) from public.fisherdata_external;”);
count
150

The two queries produce the same result as expected.

Step 9. RStudio – Read in *.csv file from Linux file system

a. Launch URL for RStudio in your favorite web browser

b. Verify that RSTUDIO is working from the correct directory on the Linux file system
> getwd();
[1] “/home/datasci1”
I am logged in as the user “datasci1” and this is the home directory for that user.

c. List the contents of the home directory for “datasci1” user to ensure the fisherdata.csv file is present.
> list.files(“/home/datasci1”)
[1] “fisherdata.csv”
The file “fisherdata.csv” is present

d. Read fisherdata.csv into R matrix
> fisherdata <- read.csv(file=”fisherdata.csv”,head=FALSE,sep=”,”);

e. Verify contents of table
> head(fisherdata, n= 5)
This should produce the first 5 rows of the fisherdata.csv file

Step 10. RStudio – Manually enter data into R

a. Launch URL for RStudio in your favorite web browser

b. Type in the command for the creation of a 2×4 matrix with data
> mymatrix = matrix(+ c(4,6,8,7,4,6,4,1), nrow=2, ncol=4, byrow=TRUE);
>

c. Issue the command to see the data you just typed in
> mymatrix;
The output should be 2×4 matrix
[,1] [,2] [,3] [,4]
[1,]     4     6     8      7
[2,]     4     6     4      1

Congratulations!  You just learned how to load data into Greenplum and R for analysis.  Further, you now know how to accomplish this task via several methods, thus giving yourself latitude on how you approach and handle data.

Louis V. Frolio

Leave a Comment

Filed under Foundations, Tutorials

Leave a Reply

Your email address will not be published. Required fields are marked *