Building an Infrastructure to Support Data Science Projects (Part 2 of 3) – Installing Greenplum with MADlib

Installing GreenplumIn the first part of this series (Part 1 of 3) we installed and configured CentOS on a virtual machine.  This laid the foundation and made ready an environment that will now be used to install Pivotal Greenplum Community Edition. This edition allows for any use on a single node per Pivotal’s license model.  Also, as part of this tutorial I will be demonstrating how to install MADlib (open-source) libraries into Greenplum.  MADlib provides a rich set of libraries for advanced in-database data analysis and mining which can be called via regular SQL. The installation of Greenplum and MADlib will facilitate some of the data science excercises I will be demonstrating in the near future.

Assumptions
– This part of the series assumes you are running CentOS 6.4 with an  XFS filesystem.  If you followed (Part 1 of 3) of this series then you are good to go.

Legend
– All Linux O.S. commands preceded by “#” implies “run as root”.
– All Linux O.S. commands preceded by “$” implies “run as gpadmin”.
– Be conscious that some commands wrap due to blog templating. An example of this is Step 40.  What appears to be two lines of text is actually one command (run as root) that is wrapped.

Let’s Begin

 

Step 1.  Download Greenplum Community Edition.  In Part 1 of this series, we installed 64 Bit CentOS so you want to download “Pivotal Greenplum Database Red Hat Enterprise Linux 5 x86“.  This version supports 64 Bit CentOS.  Using your sftp or ftp of choice move the zip file to the /opt file system on the CentOS VM.

Step 2.  Before we can install Greenplum, we first have to make a few changes to the Linux XFS file system, specifically to the mount options:

Log in to the VM as root and make a copy of the /etc/fstab file:
Eg.) # cp /etc/fstab /etc/fstab.orig
Edit the /etc/fstab file for the “/opt” file system entry.  The options portion of the entry needs to be modified:
Replace “defaults 1 2” with “rw,noatime,inode64,allocsize=16m  0 0”
Before Change
Step2_a
After Change
Step2_b

Step 3.  Next we need to add an entry to the I/O disk scheduler.
Greenplum recommends that the “deadline” policy be added to /sys/block/devname/queue/scheduler.
# echo deadline > /sys/block/sda/queue/scheduler
This will set the value.

Step 4.  Next we will set the “read-ahead” (blockdev) to 16385.

E.g.) # blockdev –setra 16385 /dev/sda — This sets the value.
# blockdev –getra /dev/sda  — This indicates the value you just set.
(That is a hyphen hyphen in front of setra and getra)
To make this change permanent (survive reboot) you just need to add the complete “setra” command to the /etc/rc.d/rc.local file

Step 5.  Make a copy of the /etc/sysctl.conf file.
Open /etc/sysctl.conf for edit and set the following parameters:

kernel.shmmax = 500000000
kernel.shmmni = 4096
kernel.shmall = 4000000000
kernel.sem = 250 512000 100 2048
kernel.sysrq = 1
kernel.core_uses_pid = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
net.ipv4.tcp_syncookies = 1
net.ipv4.ip_forward = 0
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.tcp_tw_recycle=1
net.ipv4.tcp_max_syn_backlog=4096
net.ipv4.conf.all.arp_filter = 1
net.core.netdev_max_backlog=10000
kernel.msgmni=2048
net.ipv4.ip_local_port_range = 1025 65535
vm.overcommit_memory=2

 Reboot the VM.

Step 6.  Make a copy of the file /etc/security/limits.conf
Open /etc/security/limits.conf file for edit and set the following parameters:

* soft nofile 65536
* hard nofile 65536
* soft nproc 131072
* hard nproc 131072

No reboot needed.

Step 7.  Install (if you haven’t already), configure,  and start NTP.  This is necessary to stop Greenplum from complaining later on when we run the gpcheck utility.  This protocol is important when you have multiple nodes in the Greenplum cluster and all nodes have to have their times synched.

Install (Depending on how you did your install this may already be present)
# yum install ntp
Turn on service
# chkconfig ntpd on
Synchronize the system clock with 0.pool.ntp.org server
# ntpdate pool.ntp.org
Start the NTP
# /etc/init.d/ntpd start

Step 8.  Stage greenplum-db-4.2.2.4-build-1-CE-RHEL5-x86_64.zip on /opt

Step 9.  Unzip greenplum-db-4.2.2.4-build-1-CE-RHEL5-x86_64.zip

# unzip greenplum-db-4.2.2.4-build-1-CE-RHEL5-x86_64.zip

Step 10.  Launch the installer using bash.

# /bin/bash  ./greenplum-db-4.2.2.4-build-1-CE-RHEL5-x86_64.bin

Step 11.  Accept license agreement. About 9 pages of click throughs.

Step 12.  Don’t take the default install path.  Instead,  provide the following path:
/opt/greenplum-db-4.2.2.4

Step 13.  Confirm path for Greenplum database install.

Step 14.  Confirm install path creation.

Step 15.  If not prior install of Greenplum press ENTER to skip this step.
The install is quick.  A symbolic link will be created in a directory one level up from your specified directory: /opt/greenplum-db

Step 16.  We now need to create a O.S. user to own the Greenplum installation:

# groupadd gpadmin
# useradd gpadmin -g gpadmin
# passwd gpadmin
# New password
# Retype new password.

Step 17.  Create data directory

# cd /opt/greenplum-db
# mkdir gpmaster
# cd gpmaster
# mkdir gpdata1

Step 18.  Make gpadmin the owner of the Greenplum install:

# cd /opt
#  chown -R gpadmin:gpadmin  /opt/greenplum-db-4.2.2.4

Step 19.  Add to /opt/greenplum-db/greenplum_path.sh

MASTER_DATA_DIRECTORY=/opt/greenplum-db/gpmaster/gpdata-1
PGPORT=5432
export MASTER_DATA_DIRECTORY
export PGPORT

Step 20.  We are now ready to install the “gp_init_config” file:

# cp /opt/greenplum-db/docs/cli_help/gpconfigs/gpinitsystem_singlenode /home/gpadmin/gp_init_config
# chown gpadmin:gpadmin /home/gpadmin/gp_init_config
# chmod 644 /home/gpadmin/gp_init_config

Step 21.  Open /home/gpadmin/gp_init_config for edit and set the following:

ARRAY_NAME=”Greenplum”
MACHINE_LIST_FILE=/home/gpadmin/multi_seg_hosts_file
SEG_PREFIX=gpdata
PORT_BASE=50000
declare -a DATA_DIRECTORY=(/opt/greenplum-db/gpmaster/gpdata1)
MASTER_HOSTNAME=Analytics1
MASTER_DIRECTORY=/opt/greenplum-db/gpmaster
MASTER_PORT=5432
TRUSTED SHELL=ssh
CHECK_POINT_SEGMENT=8
ENCODING=UNICODE

Step 22.  Log into the VM  or SU to “gpadmin”

Step 23.  Modify the .bashrc file for the user “gpadmin” to source:
/opt/greenplum-db/greenplum_path.sh

Step 24.  Modify the .bash_profile for the user “gpadmin” to include:
/opt/greenplum-db/bin

Step 25.  Create file /home/gpadmin/multi_seg_hosts_file and add the following: Analytics1

$ chown gpadmin:gpadmin /home/gpadmin/multi_seg_hosts_file

Step 26.  Modify /etc/hosts to include the hostname of your VM

Step 27.  Run the gpssh-exkeys utility to create and set ssh key for the host.

$ gpssh-exkeys -h Analytics1

Step 28.  Run the Greenplum configuration command:

$ gpinitsystem -c /home/gpadmin/gp_init_config

You will be prompted with “Continue with Greenplum creation”.
Type “y” then enter.

Step 29.  Run the Greenplum gpcheck utility to verify that all is well.

$ cd /home/gpadmin
$ gpcheck -f ./multi_seg_hosts_file -m Analytics1

If you are prompted with errors,  you need to address them then re-run the gpcheck command.

Step 30.  To verify that  the configuration was successful and that the Greenplum processes are running, issue the following command: # ps -ef | grep gpadmin.  You should see several Greenplum processes running.

Step 31.  Test that you can connect to the Greenplum database.

$ psql template1

You should see the following: template=#
If so then you are connected to the Greenplum database.
To exit type: template=# \q    then [ENTER]

Step 32.  Now for some DBA tasks.

Create a database for future use.
$ psql template1 then [ENTER]
template1=#  create database sandbox;
template=# \q then [ENTER]

Create a database user
psql -d postgres -h analytics1 -p 5432 -U gpadmin
or (you can use the IP)

$ psql -d postgres -h 192.168.107.100 -p 5432 -U gpadmin
postgres=# CREATE ROLE datauser WITH LOGIN;
postgres=# ALTER ROLE datauser WITH PASSWORD ‘datauser’;
postgres=# \q

Step 33.  We now need to modify the pg_hba.conf file to allow this database user to connect to the database.

$ cd /opt/greenplum-db/gpmaster/gpdata-1
$ cp pg_hba.conf pg_hba.conf.orig
Open pg_hba.conf for edit
Add the following three lines:
local     all     datauser     ident
host      all    datauser     127.0.0.1/28     trust
host      all    datauser     ::1/128                  trust

Step 34.  Load the changes.
$ pg_ctl -D /opt/greenplum-db/gpmaster/gpdata-1 reload

Step 35.  Test that you can connect to the database.

$ psql -d sandbox -h Analytics1 -p 5432 -U datauser
You should be prompted with: sandbox=>

Issue a command to see the databases:
sandbox=> \l

Exit psql utility:
sandbox=> \q

Step 36.  Now to install MADlib libraries, download MADlib 1.2 libraries here.  MADlib 1.2 documentation can be found here.

Step 37. Stage madlib-1.2-Linux.rpm on the VM. I used /opt.

Step 38.  Using the “yum” utility install madlib-1.2-Linux.rpm.

# yum install madlib-1.2-Linux.rpm –nogpgcheck
Choose “y” to “Is this OK”

Step 39.  Apply MADlib to Greenplum.

# source /opt/greenplum-db/greenplum_path.sh
# /usr/local/madlib/bin/madpack -p greenplum -c gpadmin@localhost:5432/sandbox install

Step 40.  Validate MADlib install.

# /usr/local/madlib/bin/madpack -p greenplum -c gpadmin@localhost:5432/sandbox install-check

This validation can take several minutes so be patient.

Congratulations, you just installed Pivotal Greenplum with MADlib libraries. In the final step of this series you will install R and configure it to interact with Greenplum:

(Part 3 of 3) – Installing and Configuring R with Pivotal Greenplum Connectivity.

Louis V. Frolio

2 Comments

Filed under Infrastructure, Tutorials

2 Responses to Building an Infrastructure to Support Data Science Projects (Part 2 of 3) – Installing Greenplum with MADlib

  1. Brandon Rogers

    Hi Louis,

    Thank you for creating this blog post and these detailed tutorials. Just a quick note I wanted to post on one item, in case anyone else runs into this in the future. When I got to this step:

    Step 35. Test that you can connect to the database.

    $ psql -d sandbox -h Analytics1 -p 5432 -U datauser

    I received the following error message:

    [gpadmin@Analytics1 gpdata-1]$ psql -d sandbox -h Analytics1 -p 5432 -U datauser psql: FATAL: no pg_hba.conf entry for host “192.168.198.100”, user “datauser”, database “sandbox”, SSL off

    To resolve it, I added the following line to /opt/greenplum-db/gpmaster/gpdata-1/pg_hba.conf (192.168.198.100 is the IP address of my VM):

    host all datauser 192.168.198.100/32 trust

    My Greenplum database appears to be working now.

  2. Thanks for the feedback Brandon.

Leave a Reply

Your email address will not be published. Required fields are marked *