Mendel Team

Gold evaluation

Share on

Creating Accurate Regulatory and Reference Data

Within the real world evidence space, the generally accepted process for creating a regulatory grade data set is to have two human abstractors work with the same set of documents and bring in a third reviewer to adjudicate the differences. These datasets also serve a second purpose - as a reference standard against which the performance of human abstractors can be measured. Although this remains the industry standard, it is expensive, time consuming and difficult to scale.

At Mendel we extend this framework and layer AI intelligence to build our own reference set as follows.

We start the same way as the industry standard: we have two human abstractors process the patient record independently (average of the 2 noted as H1) and a third human (R1) to adjudicate the differences. Mendel builds on the standard by including an additional abstraction layer which includes AI. We then run the patient record through our AI pipeline and have a human audit and correct the AI output (Human+AI). Finally we have a second review (R2) to adjudicate the difference between both human only and AI+human outputs.

Our goals are to ensure quality internally and for our customers. We use the gold set to measure performance of our AI models across the test cohort, generate a quality report, and conduct multiple types of validation to ensure that the data is clinically useful, has been processed correctly, and that the AI models are not skewed.

During this process, we hypothesized that results from AI + Human collaboration would rival the results generated using the previously described regulatory reference model–two human abstractors adjudicated by a third.

The Evaluation: Does combining human and AI efforts lead to high data quality?

At the end of 2022, we conducted a series of evaluations across therapeutic areas to assess how our models perform. We wanted to explore whether combining human and AI efforts lead to higher data quality than the regulatory standard and by how much.

In this experiment, we looked at a total of 140 patients across three therapeutic areas with the following sample sizes:

Breast - 40 patients
NSCLC - 40 patients
Colon - 60 patients

We calculated an F1 score to compare the performance of the average of one human abstractor or the average of two human abstractors (H1), two human abstractors with adjudication (R1), and the combination of one human and AI.

The F1 score combines the precision and recall of a classifier into a single metric by taking their harmonic mean. We then compared the F1 scores for variables across therapeutic areas.

Understanding the variance across variables

All approaches, whether human only, adjudicated or Human + AI abstraction demonstrate variability in quality across data variable types. When we think about F1 performance it helps to divide a patient’s data variables into four groups:

Variables that have a high complexity for humans, but easier for AI
Ex. Variable is difficult to find due to length of record
Variables that have a high complexity for humans and AI
These variables could be difficult to extract because they are subjective
Variable is easy for both humans and AI to extract
These variable have a clear interpretation
Variables that are easier for humans, but difficult for AI
These variables may require leaps in reasoning

There is also a situation of compound data variables. These variables depend on multiple correct predictions and are difficult for both humans and AI.

Let’s look at the variables specific to Colon Cancer.

Below we compare the F1 scores of the Human+AI approach for colon cancer variables with the F1 scores of the Human only approach. The AI + Human approach F1 score is shown through the bar graph and the Human only F1 score is plotted over it.

The Human + AI approach exceeds the quality of a human only approach for every variable we studied. This is not surprising, since leveraging the AI output gives the human abstractor a significant advantage.

Does this hold up when looking at patient data that has been double abstracted and adjudicated?

Human+AI performs better than a double extracted and adjudicated data set

In the chart below we compare the average of the pooled results across breast, lung, and colon cancers against the gold standard reference set.

The AI + Human approach has an F1 score of 92.2% and the double abstracted and adjudicated (R1) set has an F1 score of 87.8%. Both approaches perform acceptably high vs the gold set. However, the A1 + Human approach’s F1 score shows an increase of 4.77%.

In addition, the AI+Human application is ⅓ of the effort and cost of using three humans, making this approach inherently more scalable. In our next post, we will explore time savings.

Interested in learning more about this evaluation and Mendel’s process? Contact hello@mendel.ai.

Mendel is an end-to-end solution that uses the power of a machine and the nuanced understanding of a clinician to structure unstructured patient data at scale.

Exploring the Future of Healthcare AI: A Conversation with Kristin Maloney

The recent podcast featuring Kristin Maloney, hosted on Oncology Data Advisor, delves into Mendel AI's transformative role in healthcare. Kristin highlights how Mendel’s clinical AI solutions—such as Retina, Resolve, and Hypercube—are revolutionizing data-driven decision-making, empowering clinicians to extract critical insights from complex datasets quickly and accurately. Mendel AI's mission is clear: turning unstructured and structured healthcare data into actionable intelligence, bridging gaps in clinical care, and providing physicians with tools to deliver optimal patient outcomes.

Introducing Mendel's New Brand Focus: Supercharging Clinical Data Workflows in Healthcare

Mendel has evolved its brand to “Supercharge Your Clinical Data Workflows,” a shift that reflects our commitment to delivering AI solutions that genuinely enhance clinical data management. In healthcare, where talent shortages demand efficient and reliable tech, our Hypercube solution and neuro-symbolic AI bring unmatched cost-efficiency, speed, and accuracy to workflows. This shift emphasizes our focus on alleviating healthcare’s talent strain with tech that builds trust—eliminating errors and reducing the risk of hallucinations. Discover how Mendel’s transformative approach can optimize your workflows with validated solutions trusted by leaders in the industry.

Revolutionizing Patient Cohort Identification with AI – Insights from Mendel’s ACR Benchmark

Introducing ACR: A New Benchmark for Patient Cohort Retrieval This study introduces Automatic Cohort Retrieval (ACR), a novel task for efficiently identifying patient groups from large-scale medical data. Comparing AI-powered approaches, including large language models and neuro-symbolic systems, the research reveals promising advancements in automating cohort selection for clinical trials and studies. The findings highlight the potential of AI to revolutionize healthcare data analysis, while emphasizing the need for continued improvements in accuracy, efficiency, and reliability.

Introduction to Hypercube’s Ontology and Reasoning Engine

Large Language Models (LLMs) hold the potential to transform healthcare by generating clinical insights and supporting decision-making. However, LLMs face challenges such as hallucinations, lack of explainability, and limited reasoning capabilities, which restrict their effectiveness in clinical settings. Mendel's Hypercube platform addresses these limitations by integrating LLMs with structured clinical ontologies, enhancing both inference and decision-making. Unlike standard ontologies focused mainly on documentation, Mendel’s generative ontology prioritizes scalable reasoning through reductionism and emergentism, enabling more accurate clinical reasoning and streamlined data integration.

Mendel Unveils Groundbreaking Neuro-Symbolic AI System Outperforming GPT-4 for Automatic Cohort Retreival in New Study

“Our latest research at Mendel marks a significant milestone in the field of AI in general, and healthcare in particular,” said Wael Salloum, Cofounder and Chief Science Officer at Mendel. “We are the leader in clinical reasoning by coupling LLMs with our hypergraph reasoning, enhancing both the effectiveness and efficiency of patient cohort retrieval.

Improving Clinical Trial Participant Prescreening With Artificial Intelligence (AI): A Comparison of the Results of AI Assisted vs Standard Methods in 3 Oncology Trials

Delays in clinical trial enrollment and difficulties enrolling representative samples continue to vex sponsors, sites, and patient populations. Here we investigated use of an artificial intelligence-powered technology, Mendel.ai, as a means of overcoming bottlenecks and potential biases associated with standard patient prescreening processes in an oncology setting.

Coupling Symbolic Reasoning with Language Modeling for Efficient Longitudinal Understanding of Unstructured Electronic Medical Records

The application of Artificial Intelligence (AI) in healthcare has been revolutionary, especially with the recent advancements in transformer-based Large Language Models (LLMs). However, the task of understanding unstructured electronic medical records remains a challenge given the nature of the records (e.g., disorganization, inconsistency, and redundancy) and the inability of LLMs to derive reasoning paradigms that allow for comprehensive understanding of medical variables. In this work, we examine the power of coupling symbolic reasoning with language modeling toward improved understanding of unstructured clinical texts. We show that such a combination improves the extraction of several medical variables from unstructured records. In addition, we show that the state-of-the-art commercially-free LLMs enjoy retrieval capabilities comparable to those provided by their commercial counterparts. Finally, we elaborate on the need for LLM steering through the application of symbolic reasoning as the exclusive use of LLMs results in the lowest performance.

How to Approach De-Identification

Organizations that use patient data for internal or external research need to take steps to prevent the exposure of PHI to those who are not authorized to view it. They do this by redacting specific categories of identifiers from every patient document. Once the identifiers are masked, the risk profile of these datasets is significantly reduced. But how do you ensure that redaction engines are working to the highest accuracy?

Clinical Data Abstraction

Clinical Record OCR

PHI De-identification

Clinical Search Engine

Clinical Trial Matching

Clinical Data Assets

Creating Accurate Regulatory and Reference Data

The Evaluation: Does combining human and AI efforts lead to high data quality?

Understanding the variance across variables

Human+AI performs better than a double extracted and adjudicated data set

The Feed

Enhancing Oncology Clinical Trial Prescreening at UPenn with Mendel AI

Enhancing Oncology Clinical Trial Prescreening at UPenn with Mendel AI

Exploring the Future of Healthcare AI: A Conversation with Kristin Maloney

Exploring the Future of Healthcare AI: A Conversation with Kristin Maloney

Introducing Mendel's New Brand Focus: Supercharging Clinical Data Workflows in Healthcare

Introducing Mendel's New Brand Focus: Supercharging Clinical Data Workflows in Healthcare

Faithfulness Hallucination Detection in Healthcare AI: Ensuring Reliable Medical Summaries

Faithfulness Hallucination Detection in Healthcare AI: Ensuring Reliable Medical Summaries

Revolutionizing Patient Cohort Identification with AI – Insights from Mendel’s ACR Benchmark

Revolutionizing Patient Cohort Identification with AI – Insights from Mendel’s ACR Benchmark

Introduction to Hypercube’s Ontology and Reasoning Engine

Introduction to Hypercube’s Ontology and Reasoning Engine

Mendel Unveils Groundbreaking Neuro-Symbolic AI System Outperforming GPT-4 for Automatic Cohort Retreival in New Study

Mendel Unveils Groundbreaking Neuro-Symbolic AI System Outperforming GPT-4 for Automatic Cohort Retreival in New Study

Improving Clinical Trial Participant Prescreening With Artificial Intelligence (AI): A Comparison of the Results of AI Assisted vs Standard Methods in 3 Oncology Trials

Improving Clinical Trial Participant Prescreening With Artificial Intelligence (AI): A Comparison of the Results of AI Assisted vs Standard Methods in 3 Oncology Trials

Coupling Symbolic Reasoning with Language Modeling for Efficient Longitudinal Understanding of Unstructured Electronic Medical Records

Coupling Symbolic Reasoning with Language Modeling for Efficient Longitudinal Understanding of Unstructured Electronic Medical Records

How a diagnostic company was able to build a clinico-genomic database in a week

How a diagnostic company was able to build a clinico-genomic database in a week

How One Organization Changed The Way Patients are Identified for Clinical Trials with AI

How One Organization Changed The Way Patients are Identified for Clinical Trials with AI

How to Approach De-Identification

How to Approach De-Identification

Back to Top

Headquarters

Hypercube Copilots

Industry

Privacy & Legal

Company