MedEmbed: Fine-Tuned Embedding Models for Medical / Clinical IR

Community Article Published October 20, 2024

benchmark-scores

Note: This article is also available on Medium. Feel free to follow me there.

Introduction

In the era of information explosion, the medical field faces a unique challenge: how to efficiently retrieve and utilize the vast amounts of clinical data, research findings, and medical literature available. Traditional information retrieval systems often fall short when dealing with the complexity and specificity of medical terminology and concepts. This is where MedEmbed steps in, offering a solution to enhance medical information retrieval and natural language processing (NLP) tasks in healthcare.

MedEmbed is not just another embedding model; it's a family of specialized embedding models meticulously fine-tuned for medical and clinical data. By leveraging advanced machine learning techniques and synthetic data generation pipeline, MedEmbed achieves good performance in medical retrieval tasks, outperforming models many times its size.

Model Download Links: v0.1

Dataset Download Links: v1

The Challenge of Medical Information Retrieval

Complexity of Medical Data

Medical information is inherently complex, characterized by:

  1. Specialized Terminology: Medical jargon and technical terms that are rarely used in general language.
  2. Contextual Nuances: The same term can have different implications based on the medical context.
  3. Evolving Knowledge: Rapid advancements in medical research require constant updates to information systems.
  4. Interdisciplinary Nature: Medical knowledge often spans multiple domains, from biology to pharmacology to patient care.

Limitations of General-Purpose Models

While general-purpose embedding models have made significant strides in natural language understanding, they often struggle with medical data:

  1. Lack of Domain Knowledge: General models may not capture the deep semantics of medical terms.
  2. Misinterpretation of Context: Medical context can be lost or misunderstood by models trained on general text.
  3. Insufficient Specificity: General models may not distinguish between closely related medical concepts.
  4. Inability to Handle Rare Terms: Many crucial medical terms appear infrequently in general text corpora.

These limitations can lead to suboptimal performance in critical healthcare applications, potentially affecting patient care and medical research.

MedEmbed: A Tailored Approach to Medical Embeddings

The MedEmbed Family

MedEmbed addresses these challenges with a suite of models designed specifically for medical data:

  1. MedEmbed-Small-v1: A compact model ideal for resource-constrained environments or edge devices in healthcare settings. (MVP)
  2. MedEmbed-Base-v1: A balanced model offering strong performance for a wide range of medical NLP tasks.
  3. MedEmbed-Large-v1: The most powerful model in the family, providing great performance for demanding medical information retrieval tasks.

Each model in the MedEmbed family is carefully crafted to capture the intricacies of medical language while maintaining efficiency and scalability.

The Development Process: From Clinical Notes to State-of-the-Art Embeddings

The creation of MedEmbed involved a sophisticated and innovative process, combining the power of large language models with real-world clinical data.

Data Collection and Preparation

  1. Source Data: The foundation of MedEmbed is built on a vast collection of clinical notes and medical literature from PubMed Central (PMC). This ensures that the models are grounded in real-world medical language and concepts.

  2. Data Cleaning and Preprocessing: Rigorous cleaning and anonymization processes were applied to ensure data quality and protect patient privacy.

Synthetic Data Generation Pipeline

synthetic-datagen-flow

The heart of MedEmbed's success lies in its unique synthetic data generation process:

  1. LLM-Powered Generation: The cleaned clinical notes are processed through LLaMA 3.1 70B, a state-of-the-art large language model. This step generates high-quality query-response pairs that capture the complexity of medical queries and their corresponding relevant information.

  2. Query Diversity: The pipeline generates various types of queries for each clinical note:

    • Keyword-based queries
    • Natural language questions (Q/A format)
    • Queries related to treatments, procedures, and follow-ups
  3. Negative Sampling: To enhance the model's discriminative abilities, challenging negative examples are created. These are designed to be semantically close to the positive examples, forcing the model to learn fine-grained distinctions in medical contexts.

  4. Triplet Formation: The positive and negative examples are combined to form triplets (query, positive response, negative response). This format is crucial for the contrastive learning approach used in training.

Benchmark Performance

MedEmbed's performance has been rigorously evaluated on a suite of medical NLP benchmarks, demonstrating its superiority over existing models.

Evaluation Benchmarks

The models were tested on five key medical retrieval benchmarks:

  1. ArguAna
  2. MedicalQARetrieval
  3. NFCorpus
  4. PublicHealthQA
  5. TRECCOVID

Key Performance Metrics

The evaluation focused on several critical metrics:

  • nDCG (Normalized Discounted Cumulative Gain) at 1, 5, and 10
  • MAP (Mean Average Precision) at 5 and 10
  • Recall at 1, 5, and 10
  • Precision at 1 and 5
  • MRR (Mean Reciprocal Rank) at 1 and 5

The mix{N} models shown in the visualizations below were model merges created using LM_Cocktail.

performance-treccovid

performance-pubhealthqa

performance-nfcorpus

performance-medicalqa

performance-arguana

head-2-head

Highlight of Results

  1. Small Model Excellence:

    • MedEmbed-Small-v1 consistently outperformed the BAAI/bge-small-en-v1.5 model across all benchmarks.
    • Notable improvements were seen in nDCG@10 and MAP@5 metrics, with increases of >10% on some tasks.
  2. Base Model Achievements:

    • MedEmbed-Base-v0 showed significant enhancements over the BAAI/bge-base-en-v1.5 model.
    • Particularly strong performance in the MedicalQARetrieval and PublicHealthQA benchmarks, with improvements of over 10% in Recall@5 and MAP@10.
  3. Large Model Superiority:

    • MedEmbed-Large-v0 demonstrated superior performance compared to the BAAI/bge-large-en-v1.5 model.
    • Achieved top notch results on the TRECCOVID benchmark, with a >10% improvement in nDCG@5 and a 15% boost in MAP@10.
  4. Cross-Size Comparisons:

    • In a remarkable display of efficiency, MedEmbed-Small-v1 outperformed the base BAAI/bge-base-en-v1.5 model on several metrics across multiple benchmarks.
    • MedEmbed-Base-v0 showed competitive performance against larger models, often matching or exceeding the performance of BAAI/bge-large-en-v1.5.

These results underscore the effectiveness of MedEmbed's specialized training approach and its ability to capture medical domain knowledge efficiently.

Potential Real-World Applications and Impact

The potential applications of taolored embedding models in healthcare and medical research are vast and transformative:

  1. Enhanced Clinical Decision Support
  2. Accelerated Medical Research
  3. Improved Patient Care
  4. Optimized Electronic Health Record (EHR) Systems
  5. Advanced Medical Education
  6. Public Health and Epidemiology
  7. Pharmaceutical Research and Development

Future Directions and Ongoing Work

While MedEmbed has already demonstrated impressive capabilities, the journey is far from over. The team behind MedEmbed is committed to pushing the boundaries of medical AI even further:

  • Better model variants
  • Advanced Retrieval techniques like late-interaction with ColBERT
  • Improve the synthetic data pipeline

Get Involved: Hosting and Experimentation

For researchers, developers, and healthcare professionals interested in exploring the capabilities of large language models in the medical domain, we've created a convenient solution:

RunPod Template for LLaMA 3.1 70B

To facilitate experimentation and further development, we've set up a RunPod template that allows easy deployment of LLaMA 3.1 70B, the backbone of our synthetic data generation pipeline:

This template provides a hassle-free way to self host your large language models, allowing you to explore the potential of these powerful tools in your own research or applications.

Step-by-Step Guide

For a detailed walkthrough on setting up and using this template, we've prepared a comprehensive guide:

This guide covers everything from initial setup to advanced usage.

Conclusion:

The journey of MedEmbed is just beginning, and we invite the medical and AI communities to join us in exploring its capabilities, contributing to its development, and helping to realize its full potential in improving human health worldwide.


For more information, collaboration opportunities, or to access the MedEmbed models, please visit our GitHub repository or contact Abhinand Balachandran at abhinand.ml@gmail.com.

Join us in revolutionizing medical information retrieval and paving the way for a more informed, efficient, and effective healthcare future with MedEmbed.