November 20, 2023
By Ryan Douglas and Anil Tolwani, Ace Data Scientists
To accomplish our mission of advancing AI to eliminate trial and error, we are steadfastly working to construct the world's most comprehensive dataset of human health. This data fuels the AI models that generate digital twins, which is driving significant progress in neurodegenerative diseases like Alzheimer's, ALS, and FTD, as well as inflammation and immunology diseases like Crohn's.
In the first part of this blog series, we discussed the challenge of combining different data sources and emphasized the need for custom extract, transform, and load (ETL) pipelines that can handle different formats and structures.
This has been working so far; our system now holds over 170,000 patient records, each filled with detailed longitudinal health measurements. Our approach values flexibility, but more customization means more chances for error. And this is just the start—we have a backlog of over 500,000 more to go through. As we grow this number, the challenge is to keep the system flexible yet robust enough to take on 10-100x while maintaining rigorous quality standards. Ultimately, we need to drive the cost of adding new data to zero.
So how does this happen?
Clinical data is inherently complex, with everything you can imagine in size and meaning. Traditional data harmonization techniques tend not to work since coding systems vary by clinician. Unlike EHR or -omics data, there is no standardization for records.
Take, for instance, the cognitive assessments vital for neurodegenerative disease studies. We often encounter data in the form of scanned PDFs of questionnaires with clinician notes scrawled in the margins. These notes are critical as they contain insights that standardized forms may not capture. In the past, this would require a manual review by a team of clinical and data scientists, which is time-consuming and not scalable.
Enter: LLMs
Large language models (LLMs) have been at the forefront of machine learning research in 2023, and they have emerged as powerful tools for reasoning, summarizing, and comparing disparate textual information. In the context of clinical data, LLMs enable us to overcome the challenges of data heterogeneity and lack of standardization. By leveraging LLMs, we can programmatically interpret and apply rules and standards for mapping clinician notes and custom language in order to create a ground truth “blueprint” of terms.
Start here: Building a ground truth
To be able to perform data harmonization using LLMs, we need a “ground truth” set of variables and textual representation to compare against. This information serves as a benchmark, ensuring that the data harmonization process aligns with variables of clinical relevance. It is the blueprint from which our LLMs operate, enabling them to apply the necessary rules and standards to harmonize diverse datasets.
Challenge 1: Embeddings quality
The quality of data embedding is a pivotal factor in our success. In the realm of clinical datasets, noise is not just background chatter; it can contain critical signals that differentiate between symptoms and side effects.
For example, sets of questions in common motor assessment surveys will use nearly identical questions and require substantial contextual information to understand that each question assesses slightly different aspects of a given disease and should be treated as such. However, consider that LLMs are limited by the amount of context in a single token, and many of these phrases use nearly identical language. Thus, we need to find alternative ways of encoding domain-specific information beyond the textual description itself and emphasize embeddings that can discern nuanced semantic differences.
Denoising
Similarly, low-information descriptions can confound embeddings. Consider this representative diagram:
When using embeddings for measuring semantic similarity, vector similarity can be misleading. Since differences between examples can be subtle and vary widely in “information quality”, “noisy” embeddings can often crowd the space, showing high topic similarity despite having low content quality. Removing these can improve performance.
Challenge 2: Building really good internal tools
The second challenge in building a great internal application isn’t what’s being built; it’s how you build it.
There is a balance to be struck between building useful tools that make assumptions and providing flexibility so as not to pigeon-hole certain possibilities for usage.
In this context, it means providing the user with all relevant information to approve or decline a suggestion computed by the model. This could include our confidence in the suggestion, the underlying data itself, and the quality of our “ground truth” description or the data specification. It also means providing different pathways for utilizing the tool at different speeds and for users with different levels of technical ability.
Where this gets us
Marrying the flexibility of highly modular, configurable pipelines with LLMs that can synthesize and interpret clinical context for consistent, rigorous automation lets us create a system capable of nearly automating data processing. This reduces the risk of error, ensuring consistent and reliable data quality. The best is still forthcoming: as we improve the domain-specific parts of this process, we can incorporate more clinical context and process a wider variety of disease indications, trial designs, and volumes of data.
The Future
To date this year, we have released Digital Twin Generators (DTGs) in seven different disease areas, each with the potential for lowering variance, increasing statistical power, and decreasing sample sizes in clinical trials today. But this is just the start, and the data products we build today will fuel the models of tomorrow.
To learn more about how we use data, stream our recent Endpoints webinar led by Unlearn’s VP of Tech, Alex Lang, called “Data: The unsung hero of AI.”