In part 1 of this series, I demonstrated how to install, configure, and run a three node Hadoop cluster. In part 2, you were shown how to take the default “word count” YARN job that comes with Hadoop 2.2.0 and make it better. In this leg of the journey, I will demonstrate how to install and run Hive. Hive is a tool that sits atop Hadoop and facilitates YARN (next generation map-reduce) jobs without having to write Java code. With HIVE, and its scripting language HiveQL, querying data across HDFS is made simple. HiveQL is a SQL like scripting language which enables those with SQL knowledge immediate access to data in HDFS. HiveQL also lets you reference custom MapReduce scripts right in HiveQL queries.
Today, one of the hottest topics out there is “big data.” It seems that everybody is talking about it and more and more companies are throwing their hat into the big data ring. These are exciting times because there is a fundamental shift in how we think about data. Not that long ago, structured data reigned supreme. For data architects, the methods of handling data (transactional and dimensional) were based in sound theory thanks to E.F. Codd (Relational Modeling), Bill Inmon (Top Down 3NF Design), Ralph Kimball (Dimensional Modeling) and Daniel Linstedt (Data Vault Architecture). We are now living in the post-relational world where the majority of the data (estimates have it at 80%) being generated is either semi-structured, quasi-structured or unstructured (1). Further, this data is growing at a rapid rate. As of 2012 ,digital content is being created at a rate of 2.5 quintilion ( 1 with 18 trailing zeros) bytes of data each day!(2) Moreover, between 2009 and 2020 we can expect to see a 44 fold increase in all digital content. Of this data only 5% will be classified as structured (3). So, with all those impressive stats, the question staring us in the face is this: ” How do we manage this deluge of unstructured data and how do we get it to play nice with structured data?” Enter the Data Lake!
Welcome to DataTechBlog. My name is Louis and I am a data professional. I espouse all data: big, small, structured, semi-structured, unstructured, dark, sensor, I do not discriminate. For the past 20 years I have gained expertise in many aspects of data including, analytics, management, operations, architecture, technology, administration, and engineering.
Over the past several years the terms “data science” and “big data” have become commonplace. My goal is to help other data and database professionals learn about the emerging disciplines of data science and big data analytics. Here you will find tutorials, how to’s and topic discussions on various dimensions of these disciplines including data mining, exploratory data analysis, data prep/scrubbing, data engineering, tools (e.g. Greenplum, R, MADlib, Hadoop, Hive, Pig, etc.), visualizations, and much more.
Coming from a traditional data architecture background, I can help bridge the gap for people who work with RDBMS technologies who are interested in learning more about data science and big data analytics.