Ingo Steins (51) is Unbelievable Machine's Deputy Director of Operations, heading up the applications division from our base in Berlin. He joined us two years ago as a Development Team Leader, with several years of experience in software and data development and in managing large teams, and is now at the helm of three such teams distributed across our sites. Ingo is an expert on the topic of data lakes, so we asked him to tell us more about this fascinating technology.
Ingo, some people say that data lakes are just the latest incarnation of data warehouses – "old wine in a new lake", so to speak. The systems do seem quite similar. Is this perception true?
I like the image of a lake full of wine (laughs) – but unfortunately that's a misconception. Data lakes and data warehouses actually only have one thing in common, and that is the fact that they are both designed to store data. Apart from that, the systems have fundamentally different applications and offer different options to users.
Could you briefly explain the differences? What exactly is a data lake?
A data lake is a storage system or repository that gathers together enormous volumes of unstructured raw data. Like a lake, the system is fed by many different sources and data flows...
...so that's where the metaphor comes from?
Exactly (laughs). That, and its huge capacity. Data lakes allow you to store vast quantities of highly diverse data and use it for big data analytics.
And what about data warehouses?
A data warehouse is a central repository for company management, so it's quite different. Its primary role is as a component of business intelligence; it stores figures for use in process optimization planning, or for determining the strategic direction of the company. And it supports business reporting, so the data it contains must all be structured and in the same format.
So that means that data warehouses aren't actually designed for large-scale data analysis?
That's right. These systems will reach their structural and capacity limits very quickly when used in this way.
Where exactly are these limits?
I believe there are four key areas where these limits become evident. The first is the enormous volume of data we now generate. In e-commerce, for example, where each purchase process leaves a clickstream of data that allows the owner to obtain information on buying habits and how to optimize the sales process. This data is unstructured and needs to be processed quickly. There's also streaming, which generates continuous data in real time.
And the second?
The second limitation is the fact that high-quality analyses now draw on a variety of different data sources, including social media, weblogs, sensors, and mobile technology. And of course, all this data is delivered in completely different formats.
What kind of format do these analyses take?
To give you an example, if the weather forecasting data predicts a storm and heavy rain, it would be a good idea for the owner of a hardware store to stock up on pumps, because demand is highly likely to rise in the very near future.
And data warehouses are not suitable for this kind of data analysis?
Exactly. This kind of application relies on data in several different formats, and these formats cannot be combined using data warehouse technology.
What is the third limitation of the data warehouse?
The fast that the technology is very expensive. Large providers such as SAP, Microsoft, and Oracle offer various data warehouse models. To use these models, you generally need relatively new hardware and people with the expertise to manage the systems. Both of these things come at a cost.
The cost issue is compounded by the fact that the data volume grows exponentially over time. So a data warehouse must expand accordingly to continue to process the larger quantity of data, generating great expense to scale the system up to accommodate these needs.
And what's the fourth area?
The fourth point to consider is the fact that many data warehouses suffer from performance weaknesses. Their loading processes are complex and take hours, the implementation of changes is a slow and laborious process, and there are several steps to go through before you can generate even a simple analysis or report. To respond quickly, ideally in real time, you need to be able to access data significantly faster.
Do data lakes have limitations, too?
Hardly any. Data lakes are virtually limitless. They aren't products in the same way that data warehouses are, but more of a concept that is put together individually and can be expanded infinitely.
Some aspects of the data warehouse process are reversed in a data lake. For example, a data lake collects all conceivable and available data in one location, regardless of its relevance, structure, or usefulness, which results in an incredibly unstructured pool of information.
What do you need to start a data lake?
All you really need is a suitable database, which is relatively easy to set up with a solution like Hadoop.
Why is it so easy, and how does the process work?
In most cases, a data lake is based on a Hadoop cluster, which works like a large partitioned hard drive. Data lakes can store infinite different data formats in very high volumes for indefinite periods of time. And because they are built using standard software, the memory is comparatively cost-effective too.
So data lakes clearly win the day?
Definitely. Data lakes can store huge volumes of data, but need no complex formatting or maintenance. The system doesn't impose any limits on processes or processing speeds – in fact, it actually opens up new ways to exploit the data you have, and can therefore help companies more generally in the process of digitalization.
Could you give an example to explain that idea?
Of course. One of our customers, a large transport and mobility company, wants to find out about its passenger flows, including journey times and train times. To do this, it is using mobile communications data that a provider would usually sell and that it wouldn’t normally have access to. The customer is using data from different contexts and sources, and the data lake is serving as a shared database and the basis for cross-analysis.
2018 is the year of the lake.
Absolutely. To keep pace with the flow of digitalization and be equipped for the future, companies should be using data lakes. And if they haven't got one already, they should at least be looking at the possibility. The system is key to modern production. It is a vast and cost-effective data storage method, as well as a fast and flexible data management platform.
Companies who want to access a wide range of data and process it effectively in real time to answer highly specialized and complex questions will find that the data lake is the perfect infrastructure to realize this goal...
...and you're the perfect person to contact.
Thank you for sharing your fascinating insights.
Ingo Steins, data lake expert
This may interest you, too:
Data lake: The bedrock of Big Data processing
Virtual data warehousing: Even more efficient data processing
Nerd Stuff: REST in Peace – Accident-free streaming thanks to Hystrix