DATABERG
How Big Data is compromising our Data Quality?
The increased volume and variety of data that companies are managing nowadays makes it more challenging to ensure that our data has the quality needed to be used to deliver reliable insights.
The truth is that despite many companies had embraced policies, procedures, technologies, and systems to protect the quality of their data, with the explosion of Big Data many of these "strategies" need to be revisited.
Data governance and data quality concerning Big Data have still a lot of room for improvement in the vast majority of corporations. Achieving the optimal data quality, in a reasonable time and with a continuously growing volume of data is a tough challenge.
What's data quality?
Now data in companies came from diverse data sources, with different characteristics and levels of complexity. At the same time, the number of "consumers" for this data inside the company has grown exponentially, and now we need the same data to be used in several different end-use case applications, departments, areas, and purposes.
Thus, determining data quality, and taking into account the final use for the data, it's now more complicated.
Data quality can be defined based on different attributes, and depending on whom we ask inside our company, the focus will be on one type of characteristic or another. For example, Sales and Marketing might consider it more relevant to determine the quality of the data based on the business value; meanwhile, Data Science teams will emphasize the accuracy of the data.
How to measure data quality?
The very firsts definitions of Data Quality were considering almost exclusively the requirements fit of the data. But during the 90s, several studies in the field enriched the interpretation of data quality by considering other attributes such as the fitness for use.
We can find several lists of attributes that we should consider when analyzing the quality of our data. One of the most detailed ones is the one created by the Data Administration Management Association (DAMA), from the UK that considers six dimensions that are key to manage data quality:
- Accuracy: We measure the level of what the data represents precisely the real-life object or event that describes.
- Completeness: We should also measure if the data has the expected value and fulfills the requirements.
- Consistency: We measure if the data is free from contradictions and has logical coherence, in content, format, and time.
- Timeliness: We should analyze if the data is available at the time needed.
- Uniqueness: With this dimension, we verify that there is no duplicity in our data set.
- Validity: We need to measure if our data fits business needs or standards in format, type, value, and range.
But as mentioned before, Data Quality is different in any company, so we should adapt the relevance of each of these dimensions to our business needs and use cases.
Data Quality based on Data Purpose
As a result of this evolution on the way, data is leveraged and exploited inside a company, now one of the attributes that seem being better considered to define Data Quality is fitness for purpose. Our data should be adapted to different use cases, contexts, and needs.
To that end, it is also relevant to make sure that the corresponding data is in front of the decision-maker at the right time and in the proper format. If not, the data will be useless, no matter the quality of our data considering the six dimensions explained above.
Why is vital a proper Data Quality Management?
Data Quality Management consists of the set of processes that a company follows to ensure the quality of its data, from the definition of roles, policies, and responsibilities to the deployment of procedures to protect our data quality throughout the entire data value chain.
An efficient approach to Data Quality Management should include proactive processes and reactive processes, so rules and standards on how to proceed in front of possible data quality problems.
It is key to the success of our Data Quality Management that we involve IT and business areas.
Data Quality in the Big Data era
As we explained at the beginning, Big Data explosion is not helping to Data Quality level in companies. Even more relevant, the lack of Data Quality is one of the causes of the failure of many Big Data projects. Now that we have diverse data sources, some external or new, and pipelines are more challenging to control the "veracity" and quality of our data. Big Data requires superior data management and data quality management systems and processes in place. So, the promise of advanced analytics and improved insights that Big Data brings to companies has a dark side, the fact that this same Big Data threatens data integrity, accuracy, and quality seriously. A report published by Experian in 2017 about Data Management states that C-Level executives report that 33% of their organizations' data is inaccurate.
Conclusion
So due to the increase of Big Data and the crescent interests from companies on analytics and other applications based on data (machine learning, artificial intelligence, etc.), the concerns around data quality have increased. It is imperative a solid Data Governance and Data Quality management program that brings back confidence in data-centric activities.