Process mining allows manufacturers to collect incredible amounts of industrial data and to understand their internal processes and operations in more detail. As they discover and analyze new data sets, manufacturers expect to deepen their knowledge, gain meaningful insights and optimize the way they work.
However, as long as data sets remain siloed, they can’t be used to enhance operations. As such, a few years ago, manufacturers started to invest in data warehouses in order to store all data in the same place and organize it into highly structured tables. Unfortunately, new data-sets were too large and unstructured to be smoothly integrated in a data warehouse, which therefore could not serve the purpose it had been created for.
To overcome their industrial data storage issues, many manufacturing firms decided to switch to a more flexible and versatile approach to data storage: data lakes.
Data lakes: advantages and challenges
Data lakes offer two main advantages. They can store as much unstructured, structured and historical data as a user needs, independently from its value or format. And they allow data scientists to apply the most appropriate analytics tools to each data set, in its original location. As data doesn’t require to be integrated or transformed to be stored in a data lake, this is a relatively cheap storage option. As a consequence, data can be stored indefinitely, meaning data scientists can analyze industrial process data in a more comprehensive way and to a deeper level.
Technically speaking, the technology behind data lakes enables manufacturers to capture, store and analyze new, huge and heterogeneous data sets without having to worry about whether data is structured or not, and how.
However, as data lakes grow, they raise new issues that restrain manufacturers from reaping all the benefits linked to data-driven decision-making processes. Data lakes challenges include difficulties in importing new data sources, integrating internal and external data sets, sharing data as well as finding and understanding the data that has already been included in the data lake.
As you read this, you might have noticed that most of the issues faced by Chief Data Officers before data lakes still exist. Data lakes didn’t solve the problems they were meant to solve. Unsurprisingly, many data lakes have turned into data swamps, filled with stagnating data that is not used and doesn’t add any value. However, at the moment, data lakes are the only option for manufacturers to get a comprehensive view of a company’s data.
As there is no alternative, you must be aware of the issues stemming from data lakes so that you can find the best solution based on your organizational needs.
The 5 issues you must be aware of when building your company’s data lake
1) Lack of trust in sharing new data sets
When a new data set is created or acquired, its creators, or those who chose to get it, probably have a clear understanding of its data value and know whether it contains sensitive, confidential information, or not. However, those who are not part of the creators team may not enjoy the same understanding and therefore might not be trusted by the original creators.
Indeed, a common issue with creating and sharing new data sets is the lack of trust throughout the organization. When there is no process to ensure organizational trust and data safety, creators of new data sets tend to store their data where they can control it and will restrain from sharing their insightful data in the company’s data lake.
As a consequence, new data sets end up being used only by few people in the company and organizations fail to reach their initial goal of having one unique source of truth that concentrates all organizational data.
2) Data integration: still costly and time consuming
Let’s say everyone in your organization trust your processes and agrees to integrate their new data sets into the company’s data lake. While the idea behind data lakes is to import data in its raw format, this is not always possible. Actually, most data sources demand to be somehow processed in order for people in the company to be able to use it.
As a result, integrating both new data sources and old legacy systems turns out to be a very time consuming activity. Data integration can take months, even years. As a result, data scientists cannot get a comprehensive view of the information trapped in the company’s data and end up not trusting your data lake as the unique source of truth it was meant to be.
Businesses should therefore focus on figuring out innovative approaches to use and integrate new data sources into their data lake. While this might be costly and require adequate internal skills, there are currently no alternatives.
3) Storing data: what’s your smart data?
Compared to data warehouses, data lakes considerably cut the cost of data storage. Nonetheless, it still requires some relatively large investment. As such, another critical challenge stemming from data lakes is that companies often don’t know which data-sets are actually meaningful and actionable. The risk is to spend a lot of time and resources to store data that provides no real value and will just turn your lake into a stagnating data swamp.
4) Inconsistent terminology across the company
Let’s imagine you already know which data sets provide you with the most value and are worth storing. Your stakeholders trust each other and you successfully integrated all that data into your data lake. A new crucial issue arises: will users be able to find the data they need? Often, the answer is no.
A common cause for this issue is that the terminology used when storing data is inconsistent. Different departments across the organization frequently use different names to indicate the same thing. This leads to confusion, data misinterpretation and, eventually, to bad decision-making processes based on inaccurate information.
5) Combining internal and external data
While data lakes aim to be the single source of truth of a data-driven organization, it is impossible to include every new data set your data scientists feel could be useful one day, in the future. After all, it wouldn’t make much sense to integrate a complete replica of Google Maps in your data lake, even if one of your data scientist wanted to launch a geospatial analysis project.
People in the company will still have to look for information outside of your data lake. In today’s digital economical environment, organizations generate such a huge amount of data there will always be new sets of data that can help take better decisions, independently from where they originated from. This is not an issue per se. However, you should make sure your corporate data culture understands and stresses the importance of data lakes as single sources of truth.
New data sets raise new challenges but offer new opportunities
The amount of data generated in Industry 4.0 is so large that, despite all its flaws, there is no real alternative to data lakes for the manufacturers of the future. While the issues discussed in this article are crucial, manufacturers who want to win the race of digitalization must be aware of their existence and ready to respond.
Figuring out how to smoothly integrate new data sets and legacy systems while ensuring data quality, safety and value is probably the biggest source of competitive advantage that manufacturers will find for many years. Coherently, it is also the most challenging.