← All writing

The Data Revolution's Blind Spot

We have built sophisticated ways to use data — machine learning, generative AI, real-time models. We have not built the methods to know whether that data is worth using.


We have undergone a data revolution. In the past twenty years, we have seen massive changes in the data that is available, who uses it, and how it is used. In 2006, Clive Humby very quotably declared that ”data is the new oil” (Arthur, 2013). The next year, Apple released the first iPhone, heralding in the era of smartphones and the constant streams of data that they would create. In 2012, Harvard Business Review deemed ”data scientist” the sexiest job of the 21st Century (Davenport and Patil, 2012). Python, with its pseudo-code-like syntax, exploded in popularity. Open-source packages like Jupyter notebook, scikit-learn, and TensorFlow made machine learning accessible to any industry or academic field ready to embrace it. Last winter, OpenAI made publicly available ChatGPT, a conversational generative AI trained on vast amounts of data available on the internet, the legality of which is now being challenged by the copyright holders. It and other large language models have already begun to flood the internet with their generations and hallucinations. The explosion of e-commerce, app-first companies, and the internet-ofthings have created vast amounts of data for data scientists to explore. Extremely lax regulations around data collection have enabled companies to create constant streams of data tracking every click and swipe of internet users. Data brokers and corporations across industries have begun to sell, buy, and combine datasets to create digital twins of individual American consumers. Publicly available data scraped from the internet, point-of-sale data created at checkout, log data that tracks software function, personal health data, financial data, government-released data, and all other sorts of data are mixed, molded, and munged to create modern data sets. Data is no longer largely drawn from welldefined experiments. Data is now found, repurposed, reused, and continuously recycled. Whether we call it ”big” or not, modern data is well-described by the ”three V’s of Big Data”: volume, variety, and velocity. Those three v’s make it hard to understand whether or not the data created is worth creating. The size of modern datasets makes quality difficult to assess and anomalies hard to detect. The variety of sources that generate data disguise the origins of issues and underlie correlations between missingness and errors to subjects. The velocity with which it is produced and the speed with which it is used create an imposingly short timeline for data to be assessed, interpreted, and intervened upon. Assessing the quality of modern data is hard. But unless we assess it, we have no idea if the data we have is worth using. In practice data quality is often not assessed; it is assumed. The data arrives in neatly organized tables at the end of a SQL query or API call. It appears credible, plausible, and intimidating. It is, after all, created by a machine. And just like the output of generative AI chatbots, it looks convincing enough to believe. Too often it is trusted instead of tested. For these and other reasons, data quality as a field of research has not been given the attention it deserves. It is easier to assume than affirm. It is more popular and it is better rewarded. Yet, when the problem we are solving is important, when it is a matter of life and death, when it is a matter of millions of lives and deaths, it is suddenly much more clear how critical data quality is. The COVID-19 pandemic gave us a rare public example of modern data and the inherent challenges to its quality. We have this dataset with particular thanks to the fine folks at Johns Hopkins University Center for Systems Science and Engineering and with broad thanks to public health professionals, healthcare workers, journalists, researchers, and so many others. We explore the topic of data quality in COVID-19 data as an example that we hope will be applied broadly to modern data across applications and research. We also hope that exploring this data will spark fervor for data quality including and beyond industries with clear financial incentives. We hope that this will spur investment into the infrastructure that will create higher quality data for public health and other public services.