Building an Infrastructure to Support Data Science Projects (Part 3 of 3) – Installing and Configuring R / RStudio with Pivotal Greenplum Integration

RLogoIn this third and final part (Part 1 of 3, Part 2 of 3) of the series, I walk you through the installation and configuration of R and RStudio.  I also demonstrate how R is integrated with Pivotal Greenplum.  For those of you who don’t know what R is, you can go here for a lot of useful information.  In short, R is a scripting language and runtime environment used for performing complex (or simple) statistical analysis of data. This tool is available for free under the GNU General Public License.  RStudio is a free and open source IDE for R. You can go here for more information about RStudio.

Assumptions
This part of the series assumes you are running Pivotal Greenplum Community Edition.  If you followed Part 2 of 3 of this series then you are good to go.

Legend
- All Linux O.S. commands preceded by “#” implies “run as root.”
- Be conscious that some commands wrap due to blog templating. An example of this is Step 1.  What appears to be two lines of text is actually one command (run as root) that is wrapped.

Index
1. Install R: Steps 1 – 6
2. Install RStudio: Steps 7 – 10
3. Install unixODBC: Step 11
4. Install postgresql-odbc.x86_64: Step 12
5. Install and Configure RODBC and odbc.ini: Steps 13 – 15

 

Let’s Begin

Step 1.  Log in as “root” on the VM we created and built  in Part 1 and Part 2 of this series.   Download and install EPEL “Extra Packages for Enterprise Linux”, this is required to support R.

- Depending on how you did your install, this may already be present

# rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm

- If you get a “404 Not Found” error then you need to look in the repository using your web browser to see if they posted a new release.

Step 2.  Install Tcl/Tk (known as “tickle” and “tee-kay”). Tcl is a Tool Command Language comprised of a language and a set of libraries and is needed for R.   Tk is an extension to Tcl and provides an interface to X11. You can read more about Tcl/Tk here.

# yum install tcl

- You will be prompted a few times: “Is this ok [y/N]:” Type y then Enter

# yum clean all

- Do a little house cleaning of yum’s cache

Step 3.  Install R

# yum install R

- Package R.x86_64 0:3.0.1-2.el6 will be installed (this was the version at the time of my install)
- This will take a while. There are lots of rpm’s to install
- You will be prompted a few times: “Is this ok [y/N]:” Type y then Enter

Step 4.  Make sure to add /usr/bin to your .bash_profile if it is not there already

Step 5.  Invoke R at the command line

# R

E.g)

RCommandPrompt

Step 6.  Quit R

> q() [ENTER]

- You will be prompted with “Save workspace image? [y/n/c]: Choose “n” then [ENTER]

Step 7. Download RStudio RPM

# wget http://download2.rstudio.org/rstudio-server-0.98.501-x86_64.rpm

- This was the latest release at time of install.

Step 8. Install RStudio RPM package.

# yum install –nogpgcheck rstudio-server-0.98.501-x86_64.rpm

(That is a hyphen hyphen in front of nogpgcheck)
- You will be prompted: “Is this OK [y/n]: Type y then ENTER

Step 9. RStudio authenticates through the linux server, we need to create an O.S. user to enable us to use RStudio.

# useradd datasci1 -g users

# passwd datasci1

# New password

# Retype new password

Step 10. You can now log into RStudio via a web browser from the machine hosting the VM.

- Launch your browser of choice

- Type in the URL: http://<IP of your Greenplum VM>:8787/

- Provide credentials from # 9 above

RStudioIDE

You can quit RStudio by clicking on “Sign Out” in the upper right hand corner.

Step 11. Install unixODBC (Back to Linux VM console)

# yum install  unixODBC  unixODBC-devel  libtool-ltdl  libtool-ltdl-devel

- You will be prompted: “Is this ok [y/n]:” Type y then ENTER

Step 12. Install odbc-postgresql

# yum install postgresql-odbc.x86_64

- You will be prompted: “Is this ok [y/n]:” Type y then ENTER

Step 13. Log into RStudio as in #10 above

- At console issue the install.packages command for the RODBC tools

> install.packages(“RODBC”);

- If successful you will be prompted with “* DONE (RODBC)”

- On the lower right hand pane you will see several tabs. Click on the “Packages” tab then scroll down until you see “RODBC” and then check the box next to it.

> q(); then ENTER.

Or, click on “Sign Out” on the top right.

Step 14. We now need to log back into the VM as root and configure the odbc.ini file

- Create the file /etc/odbc.ini

# touch /etc/odbc.ini

# chmod 644 /etc/odbc.ini

Step 15. Open the odbc.ini file for edit and add parameters

[Data Sources]
Greenplum = Database description
[ODBC]
InstallDir = /usr/lib64
[Greenplum]
Description = for ODBC access to a Greenplum database named below
Driver = /usr/lib64/psqlodbc.so
Trace = No
TraceFile = /tmp/odbc.log
Database = sandbox
Servername = localhost
Username = datauser
Password = password
Port = 5432
Protocol = 8.4.2
ReadOnly = Yes
RowVersioning = No
ShowSystemTable = No
ShowOidColumn = No
FakeOidIndex = No
ConnSettings =

- Take note that the database user being referenced in the odbc.ini file is “datauser”. This is the user we created in Step 2 of this series. This user has the necessary credentials that allows it to connect to the Greenplum database.

Step 16. Test connection to Greeplum database from R

- Log into RStudio as in #10 above and issue the following commands

> con<-odbcConnect(“Greenplum”)

> sqlQuery(con,”select version();”)

- You just connected to the Greenplum database and asked it for its version. It should return something similar to this:
RStudioConnectedToGreenplum

Congratulations, you just installed R and RStudio and integrated with Pivotal Greenplum. With this series now complete you now have an infrastructure to support data science projects.

Louis V. Frolio

 

 

17 Comments

Filed under Infrastructure, Tutorials

17 Responses to Building an Infrastructure to Support Data Science Projects (Part 3 of 3) – Installing and Configuring R / RStudio with Pivotal Greenplum Integration

  1. Alaa Elamin

    Hello,

    I’m working on preparing the Big Data Lab course in the German
    University in Cairo, Egypt. Our lab contains 4 servers, 1 VNXe Storage
    & 26 PCs. Students use VMware vSphere Client to connect to the servers
    in order to work on their VMs. Optimizing the disk size used is our
    goal. Instead of creating multiple virtual machines with RStudio &
    Greenplum installed on each for each student, we need to have ONLY 1
    virtual machine with the database and all other virtual machines for
    students have only Rstudio. Then they connect remotely to the database
    on this VM. How can we do this remote connection to the Greenplum
    database on two different VMs (VMs are on the same network of
    course).

    Thanks in advance,
    Alaa Elamin

  2. joseph kambourakis

    In the Data Science class, the VM’s are all set up with their own databases and this helps reduce problems. Having all the machines connect to one database is much more difficult. I believe it is possible if one were to change the odbc.ini files located in the etc/ and setting the IP to the one database. My knowledge of odbc drivers is pretty limited, but that’s where I would start. Try getting in touch with Hisham Arafat in Cairo, he’s the greenplum/data science instructor there.

  3. Alaa, I have not build out an infrastructure like the one you mention. However, if I were to take on that task I would approach it as such:

    1.) Create the single instance of Greenplum with all the users defined in the database: user01, user02, userXX, etc. Each user would have access to their own schema: userschema01, userschema02, userschemaxx, etc. Each schema would have all the necessary objects (data tables, madlib,etc).
    2.) On the VMS you would have your users defined: osuser01, osuser02, osuserxx, etc.
    3.) On the VMS you would create profiles in the odbc.ini file to accommodate each unique O.S. user.
    4.) Make the appropriate adjustments to RStudio on the Vm’s.

    With this setup each user would be able to connect to RSTudio (via URL), log in with their O.S. credentials. They would then be directed to the proper profile in the odbc.ini file and this would direct them to the correct schema in the database.

    This would be my approach, keep in mind I have not done this type of setup so you may encounter bumps in the road. However, I believe that this is the correct path to take.

    Please keep us posted on your success.
    Regards, Louis.

  4. Alaa Elamin

    Thank you very much for the help. I’m going to try this procedure and update you with any success.

    Best Regards,
    Alaa Elamin.

  5. Alaa Elamin

    Dear Louis,

    I’m not experienced with ODBC, so i cannot do the right configuration for odbc.ini files.
    I configured the odbc.ini file for a remote VM as you did in (Part 3 of 3 Step 15) except for the username and password parameters. Then i singed on to RStudio with its own credentials. but I’m unable to connect to the remote database. The odbcConnect(“Greenplum”) command gives me this error:

    —————————————————————————————————————-
    > con<-odbcConnect("Greenplum");
    Warning messages:
    1: In odbcDriverConnect("DSN=Greenplum") :
    [RODBC] ERROR: state 08001, code 101, message [unixODBC]Could not connect to the server;
    Connection refused [127.0.0.1:5433]
    2: In odbcDriverConnect("DSN=Greenplum") : ODBC connection failed
    —————————————————————————————————————-

    So I would be grateful if you tell me how to solve this issue and how to create profiles in the odbc.ini file to accommodate each unique O.S. user!

    Regards,
    Alaa Elamin

  6. Alaa, from the error it seems that you may have permission issues with the Greenplum database and the database user. I would first try to connect to the database with the db user you have set up. Recall, you need to configure the pg_hba.conf file (see Part 2 of the series) to allow the database user to connect.

  7. Alaa Elamin

    Louis, Thank you very much for your help. Finally, it is successful to remotely connect to Greenplum.

  8. Have you ever thought about adding a little bit more than just your articles?
    I mean, what you say is valuable and everything. However think
    about if you added some great graphics or
    videos to give your posts more, “pop”! Your content is excellent
    but with pics and clips, this website could undeniably be one
    of the greatest in its field. Good blog!

  9. Stan, thank you for the kind words. The blog is now in its 6 month of existence and it is getting great traction.
    I am currently working through a new tutorial (Hadoop Cluster setup with Hive/Pig) and I think it lends itself nicely to a screencast.
    I am still not sure that I will screencast this tutorial, however stay tuned to see comes.

    Cheers, Louis.

  10. Versicherungsvergleiche

    Keеp on writing, gгeat job!

  11. Brandon Rogers

    Hi Louis,

    I am having difficulty installing the RODBC package in R (Step 13). Below is the output I am getting from R. It appears to download the package, but doesn’t finish installing. I did some web searching on installing packages manually, but ran into some other complications trying to follow that path.

    Can you provide any guidance?

    > install.packages(“RODBC”)
    Installing package into ‘/home/datasci1/R/x86_64-redhat-linux-gnu-library/3.1’
    (as ‘lib’ is unspecified)
    trying URL ‘http://cran.rstudio.com/src/contrib/RODBC_1.3-10.tar.gz’
    Content type ‘application/x-gzip’ length 1157263 bytes (1.1 Mb)
    opened URL
    ==================================================
    downloaded 1.1 Mb

    The downloaded source packages are in
    ‘/tmp/RtmpRXBL27/downloaded_packages’
    >

  12. Brandon, if you are using RStudio click on the “Packages” tab on the lower right pane. Do you see that “RODBC” is part of either the “User” library?

  13. Brandon Rogers

    Hi Louis. No, “RODBC” is not listed on the Packages tab in RStudio.

  14. Brandon, I have not seen this happen before. However, my first suspicion is the permissions on the Linux account you are using to log in with RStudio. Double check this aspect of the installation process.

  15. Brandon Rogers

    Hi Louis, I am logging in to RStudio using the ‘datasci1′ account, per the instructions in Steps 9 and 10. What permissions would I be looking for?

  16. Brandon, I wanted to make sure you followed the steps carefully. If you missed nothing then you are in uncharted waters. Try finding the detail log for the install. Also, verify that the version of R is
    compatible with the version of RODBC. This is all open software and it is kinda wild wild west.

  17. Brandon Rogers

    Hi Louis, I was able to get the RODBC package installed via the following:
    1. I moved the ‘RODBC_1.3-10.tar.gz’ archive to a directory I created, /data/Rpkgs
    2. From the VM terminal as root, I ran the following command:

    # R CMD INSTALL /data/Rpkgs/RODBC_1.3-10.tar.gz

    This successfully installed the RODBC package and I can see and select it in RStudio.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Powered by sweet Captcha