Modern Data Architecture: The Data Lake

Modern Data ArchitectureToday, one of the hottest topics out there is “big data.”  It seems that everybody is talking about it and more and more companies are throwing their hat into the big data ring.  These are exciting times because there is a fundamental shift in how we think about data.  Not that long ago, structured data reigned supreme.  For data architects, the methods of handling data (transactional and dimensional) were based in sound theory thanks to E.F. Codd (Relational Modeling), Bill Inmon (Top Down 3NF Design), Ralph Kimball (Dimensional Modeling) and Daniel Linstedt (Data Vault Architecture).  We are now living in the post-relational world where the majority of the data (estimates have it at 80%) being generated is either semi-structured, quasi-structured or unstructured (1).  Further, this data is growing at a rapid rate.  As of 2012 ,digital content is being created at a rate of 2.5 quintilion ( 1 with 18 trailing zeros) bytes of data each day!(2)  Moreover, between 2009 and 2020 we can expect to see a 44 fold increase in all digital content. Of this data only 5% will be classified as structured (3).  So, with all those impressive stats, the question staring us in the face is this: ” How do we manage this deluge of unstructured data and how do we get it to play nice with structured data?”  Enter the Data Lake!

Photo via

If we consider a traditional enterprise data architecture, it might resemble the following:

Data Lake

Here you can see: 1. Operational Data Sources, 2. Enterprise Data Integration Hub, 3.Enterprise Data Warehouse, 4. Presentation Layer.   The operational data sources in a traditional enterprise are primarily structured data. These could be RDBMS’, Excel spreadsheets, Well Formed Flat Files, etc.  The data sources are fed (via ETL/ELT) into the Enterprise Data Integration Hub which is where data cleansing, data transformation, and data munging happens.  The processed data are then pushed to the Enterprise Data Warehouse.  The data will end up at rest (within the  enterprise warehouse) in a number of various repositories which may include: 1. Operational Data Store (ODS),  2. Data Mart (DM), 3. Data Warehouse (DW), etc.  Finally, the presentation layer accesses the EDW to feed reports, visualizations, dashboards, etc.

However, as the world moves more towards the Third Platform, we are seeing more and more unstructured and machine generated data being manufactured.  Further, with GPS and free public Wi-Fi, more and more Data Exhaust is being generated and corralled by businesses.  With this volume of data comes the need to integrate it into the EDW. Its full value will be realized when it can be correlated to other pertinent data.

These new sources of data are putting pressure on traditional enterprise architectures.  Data Exhaust, clickstream, sensors and the like,  generally produce data that falls into one of the unstructured buckets: 1. Semi-structured, 2. Quasi-structured, 3. Unstructured.  Traditional data architecture does not (naturally) play nicely with these types of data.  As such, new technologies must be employed to facilitate cross-pollination of these disparate sources.

Data Lake

Hadoop, when introduced into the enterprise, enables analytics across all data.  This collaboration (or swimming together) of data facilitates data exploration that was not easily achievable before.  The resulting data lake empowers organizations to fully utilize all of their data assets.  Steve Todd (EMC Fellow) does a brilliant job of outlining how Pivotal is championing the modern data architecture landscape in his blog “Information Playground.”

Leveraging Hadoop should not be seen as an insurmountable obstacle.  With tools such as RHIPE,  RHadoopHivePigSqoopR, and others, analyzing and correlating all the data in an EDW is much easier.

References

[1] EMC Education Services (April 2013). “Data Science and Big Data Analytics” EMC Corporation. Accessed April 2013, from URL.
[2] Roe, C. (2012). “The Growth of Unstructured Data: What To Do with All Those Zettabytes?”. Accessed 2014 via URL.
[3] Nielsen, L, Burlingame, N. (2012). “A simple introduction to data science“. Accessed 2014.

 

1 Comment

Filed under Big Data

One Response to Modern Data Architecture: The Data Lake

  1. Vignesh

    Well written and to the point article.

Leave a Reply

Your email address will not be published. Required fields are marked *