DermLIP: Dermatology Language-Image Pretraining
Model Description
DermLIP is a vision-language model for dermatology, trained on the Derm1M dataset—the largest dermatological image-text corpus to date. This model variant (PanDerm-base-w-PubMed-256
) utilizes domain-specific pretraining to deliver superior performance compared to other DermLIP variants..
Model Details
Model Type: Pretrained Vision-Language Model (CLIP-style)
Architecture:
- Vision encoder (PanDerm-base): https://github.com/SiyuanYan1/PanDerm
- Text encoder (PubmedBert-256): https://huggingface.co/NeuML/pubmedbert-base-embeddings
Resolution: 224×224 pixels
Repository: https://github.com/SiyuanYan1/Derm1M
license: cc-by-nc-nd-4.0
Training Details
- Training data: 403,563 skin image-text pairs from Derm1M datasets. Images include both dermoscopic and clinical images.
- Training objective: image-text contrastive loss
- Hardware: 1 x Nvidia H200(~90GB memory usage)
- Hours used: ~9.5 hours
Intended Uses
Primary Use Cases
- Zero-shot classification
- Few-shot learning
- Cross-modal retrieval
- Concept annotation/explanation
How to Use
Installation
First, clone the Derm1M repository:
git clone git@github.com:SiyuanYan1/Derm1M.git
cd Derm1M
Then install the package following the instruction in the repository.
Quick Start
import open_clip
from PIL import Image
import torch
# Load model with huggingface checkpoint
model, _, preprocess = open_clip.create_model_and_transforms(
'hf-hub:redlessone/DermLIP_PanDerm-base-w-PubMed-256'
)
model.eval()
# Initialize tokenizer
tokenizer = open_clip.get_tokenizer('hf-hub:redlessone/DermLIP_PanDerm-base-w-PubMed-256')
# Read example image
image = preprocess(Image.open("your_skin_image.png")).unsqueeze(0)
# Define disease labels (example: PAD dataset classes)
PAD_CLASSNAMES = [
"nevus",
"basal cell carcinoma",
"actinic keratosis",
"seborrheic keratosis",
"squamous cell carcinoma",
"melanoma"
]
# Build text prompts
template = lambda c: f'This is a skin image of {c}'
text = tokenizer([template(c) for c in PAD_CLASSNAMES])
# Inference
with torch.no_grad(), torch.autocast("cuda"):
# Encode image and text
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Normalize features
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Compute similarity
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
# Get prediction
final_prediction = PAD_CLASSNAMES[torch.argmax(text_probs[0])]
print(f'This image is diagnosed as {final_prediction}.')
print("Label probabilities:", text_probs)
Contact
For any additional questions or comments, contact Siyuan Yan (siyuan.yan@monash.edu
),
Cite our Paper
@misc{yan2025derm1m,
title = {Derm1M: A Million‑Scale Vision‑Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology},
author = {Siyuan Yan and Ming Hu and Yiwen Jiang and Xieji Li and Hao Fei and Philipp Tschandl and Harald Kittler and Zongyuan Ge},
year = {2025},
eprint = {2503.14911},
archivePrefix= {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2503.14911}
}
@article{yan2025multimodal,
title={A multimodal vision foundation model for clinical dermatology},
author={Yan, Siyuan and Yu, Zhen and Primiero, Clare and Vico-Alonso, Cristina and Wang, Zhonghua and Yang, Litao and Tschandl, Philipp and Hu, Ming and Ju, Lie and Tan, Gin and others},
journal={Nature Medicine},
pages={1--12},
year={2025},
publisher={Nature Publishing Group}
}