Building an Infrastructure to Support Data Science Projects (Part 1 of 3) – Creating a Virtualized Environment.


As with any project or experiment,  infrastructure has to be in place to support the intended work.  For the case of a data science project, the obvious first step is the computing environment.  Simply stated, you can’t do advanced analytics on large data sets without CPU, RAM and Disk. With these items as your foundation, much can be designed, engineered and built.  Before we can walk through a data science project we need to first have hardware and software in place.  For the purposes of the tutorials here on DataTechBlog, a P.C. or laptop with adequate CPU, RAM and disk will suffice.  Further, it is  my plan to use only open or free software and code for all tutorials. You need only a reasonably spec’d computer to accomplish all that we will do here.  This tutorial will walk you through the installation of VMWare Player  and CentOS 6.x (Optimized for Pivotal Greenplum).  This lays the foundation for the next steps which will include the installation of Pivotal Greenplum, MADlib libraries, R, and R-Studio.  When this environment is complete, you will be able to perform many types of “in database” analysis using SQL with MADlib, analysis using R with Greenplum, and analysis with R against flat files or manually entered data.

Let’s Begin

Step 1.  
Download and install VMWare Player.  This should take only 5-10 minutes.  Because we are installing a 64 bit O.S. you need to verify that your CPU supports 64 bit virtualization.  You can check for that support here.  If your CPU does support virtualization you need to check that your BIOS is properly enabled.  Here is a great Youtube video that talks you through the proper BIOS configuration to support 64 bit virtualization.
Step 2.  Download and stage CentOS ISO (At the time of this post 6.4 was the latest) images.  There are many places from which you can download these files, I found mine here.  The two files you want are CentOS-6.x-x86_64-bin-DVD1.iso, and  CentOS-6.x-x86_64-bin-DVD2.iso.  Where “6.x” is the latest release
Step 3.  Launch VMWare Player and choose “Create New Virtual Machine.”
Step 4.   At “New Virtual Machine Wizard” choose “I will install the operating system later.”
Click “Next”
Step 5.  At the “Select a Guest Operating System” choose “Linux”, Version CentOS 64-bit.
Click “Next”
Step 6.  At the “Name the Virtual Machine” give the VM a location and a name.   Ensure that you have adequate disk space for the location of the VM files.
Click “Next”
Step 7.  At the “Specify Disk Capacity” you can take the defaults. This is good enough for our work.
Click “Next”
Step 8.  Review the settings then click on “Finish”.   You now have a VM Shell that is ready for the O.S.
Click “Next”
Step 9.  Right click on your new VM and select “Virtual Machine Settings”
Click “Next”
Step 10.  Click on “CD/DVD (IDE)” in the window, then “Use an ISO Image file:”   Browse to and choose: CentOS-6.x-x86_64-bin-DVD1.iso.   Take the rest of the defaults, ensure that  the “Network Adapter”  setting is for “NAT”
Click “OK”
Step 11.  Back on the Home screen choose “Play virtual machine”.  At the “Welcome to CentOS 6.x!” screen choose  “Install or upgrade an existing system”.
Click “Next”
Step 12.   At the “Test Media” screen choose “Skip”.
Click “Next”
Step 13.  Click on “Next” at the CentOS 6 screen.  At the Language screen pick your language.
Click “Next”
Step 14.  Choose your keyboard setup at the keyboard screen.
Click “Next”
Step 15.  Choose “Basic Storage Devices”.
Click “Next”
Step 16.   At the “Storage Device Warning” screen choose “Yes, discard any data”.
Step 17.  Provide a hostname for your VM.
Click “Next”
Step 18.  Pick a time zone.
Click “Next”
Step 19.  Provide a password for the “root” user.
Click “Next”
Step 20.  At the “Which type of installation would you like” choose “Create Custom Layout”
Click “Next”
Step 21.  At the disk partition utility click on “Free”,
                      then click on “Create”,
                       then choose “Standard Partition”,
                       then click on “Create”
                       For file system type choose “swap”
                       For Size choose 1000 MB
Click “OK”
Step 22.  At the disk partition utility click on “Free”,
                       then click on “Create”,
                        then choose “Standard Partition”,
                        then click on “Create”
                        For file system type choose “ext3”
                        For Mount Point choose “/boot”
                        For Size choose 250 MB
Click “OK”
Step 23.  At the disk partition utility click on “Free”,
                       then click on “Create”,
                        then choose “Standard Partition”,
                        then click on “Create”
                        For file system type choose “ext3”
                        For Mount Point choose “/”
                        For Size choose 6000 MB
Click “OK”
Step 24.  At the disk partition utility click on “Free”,
                        then click on “Create”,
                         then choose “Standard Partition”,
                         then click on “Create”
                         For file system type choose “xfs”
                         For Mount Point choose “/opt”
                         For Size go to the “Additional Size Optons” pane and choose “Fill to
                         maximum allowable size”.  This tells the installer to use all remaining
                         space for this mount point.
                         ** Make sure that you choose “xfs” as the file system type for the
                          /opt mount point. This is crucial for the installation of Pivotal
Click “Next”
Step 25.  At the “Format Warnings” screen choose “Format”
Choose “Write Changes to disk”
Click “Next”
Step 26.  At the Boot Loader Screen take defaults.
Click “Next”
Step 27.   At O.S. install screen choose “Basic Server”.
                        Choose “Customize Now” at the bottom of the screen.
Click “Next”
Step 28.  Choose the following add-ins:
                       Applications => Internet Browser
                       Desktops => General Purpose Desktop
                       Desktops => Graphical Administration Tools
                       Dekstops => X Window System
                       Servers => System Administration Tools
Click “Next”
Step 29.  When complete you will be prompted to “Reboot”.
Click on “Reboot”
Step 30.  When the system comes back up you will be asked if you would like to make changes to the system configuration.
                       Choose “Firewall Configuration”: Disable Firewall.
                       Highlight “OK” then “Enter”
                       Tab to “Quit” then “Enter”
Step 31.  At the “login” prompt log in as “root” with the password from #19 above.  If the desktop did not launch issue the command “startx” at the command prompt: # startx
Step 32.  Once the desktop is active right click and choose “Open in Terminal” then issue the “route” command: # route.  This will return a items that we will be using to enable networking on the VM.  You need to look for and make note of the following:
                         Default Gateway: In my case it was
                         Network Mask: In my case it was
** In my case 192.168.107.* is the basis for network I am configuring. Your VM may produce a different  basis but the pattern (as shown here) will be the same in your environment.
Step 33.  I will not edit the ifcg-eth0 file, but first I will make a backup copy:
#  cp /etc/sysconfig/network-scripts/ifcfg-eth0 /etc/sysconfig/network-scripts/ifcfg-eth0.orig
                         Configure /etc/sysconfig/network-scripts/ifcfg-eth0 as such:
                          a.) Leave HWADDR as is
                          b.) DEVICE=eth0
                          c.) BOOTPROTO=static
                          d.) ONBOOT=yes
                          e.) IPADDR=
                          f.) NETMASK=
                          g.) BROADCAST=
                          h.) GATEWAY=
                          i.) DNS1=
Step 34.  Restart the network services for the change to take affect.
                       # service network restart
Step 35.  Test that the networking is is working as expected.
                       # ping
                        You should see successful ping results.


Congratulations, you now have a virtualized environment which will be the foundation for the next part of this series:

(Part 2 of 3) – Installing Pivotal Greenplum with MADlib.

Louis V. Frolio

Leave a Comment

Filed under Infrastructure, Tutorials

Leave a Reply

Your email address will not be published. Required fields are marked *