Data Lake Practical Techniques: Part 1 – Introduction

“Data Lake” is one of the current hot topics in Big Data with lots of press and little practical content. A big part of the confusion and controversy arises from the tension between hype and common sense.  Proponents promise phenomenal insights and actionable predictions about customer behavior, employee behavior, doctors’ handwritten diagnostic notes and so on. Basically, they claim you dump documents and data into the lake and let the algorithms make sense of it all.

Of course, that’s patent nonsense. Unfortunately, the hype obscures some real value in the data lake concept. In this and following posts we’ll see the practical value in data lakes and some simple techniques to make data lakes usable and useful.

In other words, I’m not here to debate usefulness or use cases – just some insight, cautionary notes and concrete, real-world techniques to help you succeed.  These practical techniques are founded on my belief that data lakes are fundamentally no different from data warehouses I have designed and built in the past.

For example, one of the novelties of data lakes is that they can be used to house unstructured data. But all data has structure – for text or unstructured data, it’s just a matter of teasing out that structure. Before any text mining can be performed, the first step is to convert the text into structured objects which can be analyzed.

And for any data to be usable, structured or unstructured, you must have the associated metadata.

In my next post, I’ll explain in detail why there is no such thing as unstructured data and how that realization will enable you to get value out of a data lake.

Part 2 – Metadata (coming soon)

  1. Expecting next group of papers in this topic as I have requirement to do a POC in order shift from our traditional datawarehouse to Data Lake.

