Articles/What the hell is a Data Lake?

What the hell is a Data Lake?

This article is supposed to be about Data Lakes, what they are and why forward-thinking businesses need one, but let me start with an apology.

As a result of numerous personal flaws that we do not have time to address here, I feel compelled to include several water-based puns. Look I'm sorry, that is just the kind of person I am, but I promise I'll keep it to a minimum. Ok, with that out of the way let's dive in....shit.

So why should you risk wasting your time learning about Data Lakes? It's an excellent question the answer to which is intrinsically linked to both what a Data Lake is, and consequently; why they have the potential to be so useful.

Understanding Data Lakes

Taking risks is at the core of all successful businesses, but selecting which to take is a huge challenge. Take Kodak's failure to react to the rise of digital photography. Despite indications to the contrary, they predicted a future where traditional film and digital photography could coexist in harmony. How wrong they were.

Yet, except for that friend of your mum that owns an incense and crystal shop, there is broad agreement that no-one can truly predict the future with certainty. That said, we are only here living, breathing and trying to understand Data Lakes because we come from a long line of successful predictors.

Now, your ancestors might not have been predicting with certainty, but they must have done better than the chaps next door who took a risk on the delicious-looking deadly nightshade berries, right? And in business, being better than the chaps next door is often enough to remain off the endangered list.

So how can we go from a finger in the air style of prediction to something with a bit more certainty? The answer is data. Enough data to fill a... you can see where I'm going with this.

Connecting Your Data

The aim of a Data Lake is relatively simple. Gather all your data together in one place and analyse it. If all your data is in one place, you'll be able to see patterns and connections that would have otherwise remained hidden.

However, as with all simple aims, there are a few more challenges to overcome before a business is swimming (sorry) in the success that increasingly accurate predictions bring.

Each specialised department, within a business, has a distinct way of storing and accessing the data that is valuable to them. Each of these systems is doing an excellent job for their particular department but were never designed to be connected to allow cross-pollination of data.

This lack of connection is a massive problem. Data Lakes can only realise their potential when all the data is speaking the same language. So if collecting it wasn't challenging enough, step two is to find a common language. Only then can you find the patterns and connections that lead to efficiencies to put things bluntly, more money. And just like knowledge, who doesn't want more of that?

Right then, your Data Lake is full, and you have overcome the challenges of conflicting languages, what next?

The next challenge is wading (sorry again) though such an unorganised sea (I'll stop soon) of raw information. It's highly likely you are drowning (oops) in multiple copies of the same data. So, how do you know what to consolidate, and what to kill?

What's Next

I'm not going to bore you with the technical side of the organisation and consolidation process. Suffice to say some 'computer magic' is used to automate the most statistically apparent matches. What remains (the soft-matches) must be done by hand. These soft-matches can number in the tens of thousands, but I'm reliably informed that computer scientists love that kind of thing.

But what are the tangible rewards for those who have invested in this kind of technology? A strong example of how this kind of joined-up thinking can impact business is that of American Airways. Connecting their systems allowed them to identify an item that was carried on all flights but most commonly thrown away: the humble olive. By removing this, the carrier was able to cut $40K from annual fuel costs overnight.

Consolidated data means you can make connections and uncover previously hidden patterns. But be warned, data is indiscriminate. It shines a light into every corner, and onto every olive. This level of visibility might be a scary prospect if you have been hiding in the corner, with the olives.

Understanding the importance of this kind of technology allows businesses to unlock their full potential by exploring connections and possible solutions, which would otherwise remain invisible. And, as I'm sure you'll agree, it is always better to know more than the berry-eaters down the road.