What Data Quality Demands of Us
Data doesn't need to be perfect. But its imperfections must be understood. A closing argument for building a practical, urgent field of data quality research.
Data does not need to be perfect to be useful. But its imperfections must be understood if the data is to be used effectively. The US’s COVID-19 data provided us an opportunity to explore just how messy and muddled modern data actually is. We dove into the details of why this data was prone to failure, but we didn’t discover any thematic problem that was particular to public health data. The variety, velocity, and volume of data drove quality issues and constraints around assessing them. These ”Three V’s” typify modern data. Whether it’s the variety of simultaneously supported app versions, the minute-by-minute velocity of streaming exit poll data, or the massive volume of consumer data from international e-commerce giants, the same underlying issues pervade the data that drives industry, government, and our private lives. The need for research into data quality is pressing. We have developed complicated and impressive ways to use data, through machine learning, neural networks, generative AI, and other models. We have not developed methods to assure these models are trained on data of sufficient quality to support them. Companies, governments, and nonprofits increasingly base their decisions and direct capital according to data. Yet, we have not developed standards by which to evaluate that data. Whether it’s credit checks, admissions decisions, mortgage approvals, or employee background checks, our personal lives are increasingly determined by data. Unfortunately, whether or not the data is actually up to the task largely goes unquestioned. The advantage of using high quality data is undeniable. It is the difference between fact and rumor, between witness and gossip. Companies that pursue data quality will undoubtedly gain advantage. Governments that ensure data quality can respond to reality instead of misrepresentation. Individuals could be treated as they actually are, and not as a reflection of the resources available to collect data on them. There is so much urgent, necessary work to be done in the field of data quality. We need real examples with real constraints, specific metrics and actual datasets. We must build from the particular to grow the foundation for the general. If we are to make decisions or automate actions, if we are to build our society or expand human knowledge on the foundations of data, we can only do so conscionably if we do so in light of our data quality. For that, we need to build a practical field of data quality research.