Data volumes are growing at an exponential rate causing problems for traditional IT infrastructures. As a result, we are seeing more and more organizations taking advantage of emerging technologies, like Hadoop, to help mitigate the pressure of exploding data volumes. Hadoop and its eco-system of tools play an important role in tackling tough problems that are plaguing traditional IT and data warehouse environments.
Specifically, Extract, Transform, and Load (ETL) can be offloaded to Hadoop to address the problem of exploding data volumes that are breaking traditional IT and ETL processes. Using screen casts and animated video, I will demonstrate how Hadoop can be used to offload the most taxing ETL workloads.
First, I will demonstrate the overarching problem of ETL overload in a fictitious company called Acme Sales. Proceeding from there are actual demonstrations (screen casts) of an ETL offload using Hadoop, Hive, and Sqoop.
Before we begin I would like to introduce Noelle Dattilo, the newest guest author on DataTechBlog. Noelle is an education expert who specializes in animation technology. Noelle gets full credit for creating all the animations you are about to see in this post. I asked Noelle to explain her process for creating compelling animations.
“These series of animations are created by an on-line program called VideoScribe, a white board animation tool. These animations leverage problem-based learning to paint a conceptual picture in a form that is easily digestible to a wide audience. To create the animations, I found some interesting graphics, turned them into SVG files (that’s the tricky part,) uploaded them into VideoScribe and placed them in the order to be drawn. Once I uploaded the audio track that Louis recorded, I synced the timing for each animation to be drawn, with the track, and voila` we have our Acme Sales animations.”
Handy Tip: After you click on the video and it starts to play, navigate to the bottom of the video window and click on “settings.” Choose “Quality = 1080p HD”, then maximize the window for the best viewing experience. This is especially true for the last two videos.
Video #1: Visualizing the Problem
In this brief animated video, we present Acme Sales, its data overload, and how Hadoop and its eco-system of tools can be used to mitigate pressure on the nightly ETL process.
Video #2: Loading Data into Hadoop using Hive and Sqoop
Using Hive, nightly transactional data (pushed down from Acme’s 1600 stores) is pulled from a staging area (file system) and then loaded into HDFS. Sqoop is then used to extract master data from Acme’s enterprise data warehouse.
Video #3: Transforming Data with Hive
Using Hive, raw transactional data is processed with Acme’s master data in preparation for the enterprise data warehouse. The processed data is then pushed to Acme’s warehouse using Sqoop.
Regards, Louis & Noelle.