Multimodal Chest X-ray Retrieval & Diagnosis (ClinicalBERT + Swin)

This model jointly encodes chest X-rays (DICOM) and radiology reports (XML) to:

  • Predict medical conditions from multimodal input (image + text)
  • Retrieve similar cases using shared disease-aware embeddings
  • Provide visual explanations using attention and Integrated Gradients (IG)

Developed as a final project at HCMUS.


Model Architecture

  • Image Encoder: Swin Transformer (pretrained, fine-tuned)
  • Text Encoder: ClinicalBERT
  • Fusion Module: Cross-modal attention with optional hybrid FFN layers
  • Losses: BCE + Focal Loss for multi-label classification

Embeddings from both modalities are projected into a shared joint space, enabling retrieval and explanation.


Training Data

  • Dataset: NIH Open-i Chest X-ray Dataset
  • Input Modalities:
    • Chest X-ray DICOMs
    • Associated XML radiology reports
  • Labels: MeSH-derived disease categories (multi-label)

Intended Uses

  • Clinical Education: Case similarity search for radiology students

  • Research: Baseline for multimodal medical retrieval

  • Explainability: Visualize disease evidence in both image and text

Model Performance

Classification

The model was evaluated on a held-out evaluation set and a separate test set across 22 disease labels. Performance metrics include Precision (Prec), Recall (Rec), F1-score, and AUROC.

Metric Eval Set (Macro Avg) Test Set (Macro Avg)
Precision 0.826 0.825
Recall 0.829 0.812
F1-score 0.825 0.800
AUROC 0.924 0.943

The model achieves strong label-level performance, particularly on common findings such as COPD, Cardiomegaly, and Musculoskeletal degenerative diseases. Rare conditions such as Air Leak Syndromes show lower F1 scores, reflecting data imbalance.


Retrieval Performance

Retrieval was evaluated under two protocols:

Protocol P@5 mAP MRR Avg Time (ms)
Generalization (test → test) 0.776 0.0058 0.848 0.99
Historical (test → train) 0.794 0.0008 0.881 2.19

Retrieval Diversity

Metric Mean Std. Dev Median
Retrieval Diversity Score 0.217 0.041 0.222
Retrieval Overlap IoU@5 0.783 0.041 0.778

The model retrieves diverse and relevant cases, enabling multimodal explanation and case-based reasoning for clinical education.


Notes

  • Retrieval and diversity metrics highlight the model’s ability to surface multiple relevant cases per query.
  • Lower performance on some rare labels may reflect dataset imbalance in Open-i.

Limitations & Risks

  • Trained on a public dataset (Open-i) — may not generalize to other hospitals

  • Explanations are not clinically validated

  • Not for diagnostic use in real-world settings


Acknowledgments

  • NIH Open-i Dataset

  • Swin Transformer (Timm)

  • ClinicalBERT (Emily Alsentzer)

  • Captum (for IG explanations)

  • Gam-CAM

Code link: GitHub

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support