Data Warehouse ETL Offload with Hadoop.

Data Warehouse ETL Offload with Hadoop

Data volumes are growing at an exponential rate causing problems for traditional IT infrastructures.  As a result, we are seeing more and more organizations taking advantage of emerging technologies, like Hadoop, to help mitigate the pressure of exploding data volumes.  Hadoop and its eco-system of tools play an important role in tackling tough problems that are plaguing traditional IT and data warehouse environments.

Specifically, Extract, Transform, and Load (ETL) can be offloaded to Hadoop to address the problem of exploding data volumes that are breaking traditional IT and ETL processes.  Using screen casts and animated video, I will demonstrate how Hadoop can be used to offload the most taxing ETL workloads.

First, I will demonstrate the overarching problem of ETL overload in a fictitious company called Acme Sales.  Proceeding from there are actual demonstrations (screen casts) of an ETL offload using Hadoop, Hive, and Sqoop.

Before we begin I would like to introduce Noelle Dattilo, the newest guest author on DataTechBlog.  Noelle is an education expert who specializes in animation technology.  Noelle gets full credit for creating all the animations you are about to see in this post.  I asked Noelle to explain her process for creating compelling animations.

“These series of animations are created by an on-line program called VideoScribe, a white board animation tool.  These animations leverage problem-based learning to paint a conceptual picture in a form that is easily digestible to a wide audience.  To create the animations, I found some interesting graphics, turned them into SVG files (that’s the tricky part,) uploaded them into VideoScribe and placed them in the order to be drawn.  Once I uploaded the audio track that Louis recorded, I synced the timing for each animation to be drawn, with the track, and voila` we have our Acme Sales animations.”

Photo via

Handy Tip: After you click on the video and it starts to play, navigate to the bottom of the video window and click on “settings.” Choose “Quality = 1080p HD”, then maximize the window for the best viewing experience. This is especially true for the last two videos.

Video #1:  Visualizing the Problem

In this brief animated video, we present Acme Sales, its data overload, and how Hadoop and its eco-system of tools can be used to mitigate pressure on the nightly ETL process.



Video #2:  Loading Data into Hadoop using Hive and Sqoop

Using Hive, nightly transactional data (pushed down from Acme’s 1600 stores) is pulled from a staging area (file system) and then loaded into HDFS. Sqoop is then used to extract master data from Acme’s enterprise data warehouse.



Video #3:  Transforming Data with Hive

Using Hive, raw transactional data is processed with Acme’s master data in preparation for the enterprise data warehouse. The processed data is then pushed to Acme’s warehouse using Sqoop.



Regards, Louis & Noelle.


Filed under Big Data Use Cases, Tutorials

2 Responses to Data Warehouse ETL Offload with Hadoop.

  1. Amit

    Hi LOUIS,
    I like the acme case study you shared. I am new to Hadoop world and trying to learn, though having rich amount of experiance in traditional BI. Just have few queries below, appreciate if you can answer them

    1. By which program you are loading 3100 sales files to HDFS. How you are scheduling it on daily basis
    2. How much is the total volumn of data in the cluster, what is the avg daily growth.
    3. Does loading files to HDFS is taking less time rather than loading them to DB.
    4. How you are handling incremental changes in HDFS/Hive with these files.
    5. How big is the hadoop cluster (how many nodes, commodity hardware configuration)

  2. Amit, thank you for reading my blog and asking questions. Please see my comments below:

    1. Apache Sqoop was used for this tutorial but many other tools exist. Scheduling can occur through a variety of mechanisms including Cron, Jenkins, etc.
    2. This tutorial used a small dataset to demonstrate the process. However, the process can scale to PB’s given the right process and technology.
    3. Loading data into HDFS is much faster because HDFS does not care what data is placed into it, it is a what we call a “schema on read” process. Take a moment and research schema on read, and schema on write.
    4. Remember that HDFS/Hive is a great way to offload EDW ETL and in some use cases it can act as an EDW. However, traditional EDW’s still play a key role in business. I don’t expect to see traditional EDW’s going away anytime soon.
    5. The size of a Hadoop cluster is directly correlated to the workload put upon it. This could mean a cluster has only 5 nodes but could go as high as 100 or more. Let me know if you have further questions.

    Regards, Louis.

Leave a Reply

Your email address will not be published. Required fields are marked *