Data-Centric AI is Making Waves

AI experts are turning their focus to data and we think the potential is huge.

Data-Centric AI is Making Waves

The world of artificial intelligence (AI) is rapidly changing, and one of the shifts we’re most excited about is a refocusing on data.

As a bit of background, AI is made up of two major components, the model and the data. The models are essentially the tools or algorithms that enable computers to analyze data. The data is what the models are using to make decisions from.

Historically, the vast majority of artificial intelligence developers have focused their efforts on perfecting their models. This approach is known as model-centric AI, and has led to advanced machine learning (ML) systems that are reliable, and capable of a great deal. Model-centric AI methods focus on continuously iterating and improving the code or algorithm and are best suited for solving problems in areas where the data is abundant and the data quality risks are known and accounted for.

In the past few years, however, there has been a shift in perspective among AI experts toward data-centric AI that has piqued our interest. In this emerging approach researchers are looking at how they can change or improve the data they are using as a way to improve the performance of their AI models. Andrew Ng, AI trailblazer, defines data-centric AI as “the discipline of systematically engineering the data needed to build a successful AI system,” essentially ensuring a higher importance is placed on the quality of the data that is integrated throughout the model creation phases.


This shift has begun as a reaction to a handful of ongoing issues in AI research as well as a recognition that many existing AI systems are maturing to the point that further investment in the model is unlikely to yield meaningful improvements. Because of this there has been a movement to look outside of the code for how to improve output even further. One of the most pressing issues that is confronting model-centric AI is the risk of data cascades.

Data cascades are not a new problem, but are often hidden, misunderstood, and even, at times, ignored. These are compounding events that arise due to low quality data that can cause negative, downstream effects. Essentially when data quality isn’t prioritized, AI models can create a kind of technical debt that compromises the output, leading to failures and an overall loss of trust in the AI system.

As Google researchers recently pointed out from a survey of AI practitioners, data cascades in high-stakes AI are pervasive and were present 92% of the time. Google also points out that these cascades, and the threat resulting from unreliable and inaccurate results, are largely avoidable. When we take a step back and think about how our drug data and knowledge is used, and in terms of the larger medical care and treatment industry, any level of upstream error can have dangerous real-world consequences for patients.

With the shift to data-centric AI we are seeing a higher priority being placed on the quality of the data used in AI systems, and as a result we believe we’ll see more accurate and reliable outputs. Currently, healthcare is generating the world’s largest volume of data, and it doesn’t look like it’s going to slow down anytime soon. It is estimated that by 2025, 36% of the world’s generated data will be healthcare data and every year we are seeing more than two million scientific articles published. We are data rich.

As we amass this rich, global collection of research data, we could see greater innovation in drug discovery and healthcare, but only if we can unlock the potential. Yet, this cornucopia of data is presenting its own problems. It is estimated that seven percent of systematic reviews are inaccurate within 24 hours of publication and that 23% of reviews will have incorrect conclusions after just two years if they aren’t updated. As data collection and production continue to ramp up the potential for better health outcomes only get stronger. Unfortunately, much of the world’s data remains disconnected, disorganized, conflicting, and unstructured.

The shift to data-centric AI is a natural progression that may be a key part to unlocking the full potential of the systems we’ve worked so hard to create. In order for a data-centric approach to work it will be important to map and normalize data from experiments and real-world evidence into deeply structured datasets and will require us to seriously consider the quality of the data being generated.

To truly advance the healthcare industry we need to break down existing silos in pharma teams and fully buy into the importance of data excellence. To ensure the data produced in the drug discovery space or scientific community is supporting better AI, we need a cross-disciplinary approach that integrates subject matter expertise with engineering and operations. If we are able to balance these competing priorities, and center high quality data in the process of developing AI systems there is the potential to substantially speed up the development of safer and more effective medical treatments.

Naturally this has us excited. We spend our days obsessing about data quality and building well structured data that reflects or can be mapped to the real world. We believe in the value of looking at data quality through a different lens—one that incorporates deep data science and biomedical expertise, with an understanding of the nature of AI.

We’re working diligently to be the space AI researchers turn to for the highest quality data and are very excited about a future where data quality is a first class citizen in the minds of AI researchers, developers, and the institutions powering the next wave of healthcare.