Model Card for CLIP-RoBERTa Fine-Tuned Model

This model is a fine-tuned version of the CLIP model combining a vision model (openai/clip-vit-base-patch32) with a text model (roberta-base). The model is fine-tuned to better handle image-text matching tasks, utilizing a variety of training strategies for enhanced performance.

Model fine tuned for Project: DermAi-Viz

Model Details

Model Description

This model card describes a Vision-Text Dual Encoder model fine-tuned from the original openai/clip-vit-base-patch32 and roberta-base models. The model is specifically adapted for tasks that involve joint processing of images and textual descriptions, leveraging both the image encoding capabilities of the CLIP model and the language understanding of RoBERTa.

Developed by: Dhruv Parthasarathy
Model type: Vision-Text Dual Encoder
Language(s) (NLP): English
License: Apache 2.0
Finetuned from model: openai/clip-vit-base-patch32 and roberta-base

Model Sources

Repository: GitHub
Demo: [Optional, link to a demo if applicable]

Uses

Direct Use

This model can be directly used for image-text matching tasks, such as searching for relevant images based on a textual query or generating captions for images.

Downstream Use

The model can be further fine-tuned for specific tasks such as image classification with text guidance, visual question answering, or any other task that benefits from multi-modal inputs.

Out-of-Scope Use

This model is not suitable for tasks unrelated to image and text processing. It may also not perform well on non-English texts or images outside the scope of the fine-tuning data.

Bias, Risks, and Limitations

As with most large models, this model may inherit biases present in the pretraining datasets. Users should be cautious when deploying this model in sensitive applications, particularly where fairness and bias are of concern. Additionally, the model's performance might degrade on data that is significantly different from the training data, especially in terms of cultural, linguistic, or visual context.

Recommendations

Users should validate the model's performance on their specific use case, particularly looking for biases in the model's predictions. Additional fine-tuning with a carefully curated dataset may be necessary to mitigate biases. It's also recommended to monitor the model's outputs regularly to ensure its predictions remain relevant and unbiased.

How to Get Started with the Model

from transformers import VisionTextDualEncoderModel, VisionTextDualEncoderProcessor

# Load the model and processor
model = VisionTextDualEncoderModel.from_pretrained("your-username/clip-roberta-finetuned")
processor = VisionTextDualEncoderProcessor.from_pretrained("your-username/clip-roberta-finetuned")

# Example usage
inputs = processor(text=["a photo of a cat"], images=["path_to_image.jpg"], return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image

Training Details

Training Data

The model was fine-tuned on a custom dataset containing paired image-text data, focusing on diverse skin tones and various diseases. This dataset includes high-resolution images of dermatological conditions along with descriptive captions. The training data was carefully preprocessed to ensure high quality, with images resized, normalized, and tokenized text inputs.

Training Procedure

The fine-tuning process was conducted on a GPU-accelerated environment, leveraging the following setup and hyperparameters:

Preprocessing

Images: Resized to the appropriate dimensions, normalized according to the standard mean and standard deviation used in CLIP models.
Text: Tokenized using the roberta-base tokenizer.

Training Hyperparameters

Training regime: Mixed precision (fp16) training to optimize memory and computation.
Batch size: 32 for both training and evaluation.
Learning rate: 3e-5, with a cosine learning rate schedule with restarts.
Weight decay: 0.01 to prevent overfitting.
Warmup steps: 1000 steps to stabilize the learning process.
Number of epochs: 1000 to ensure thorough training.
Gradient accumulation steps: 4 to effectively simulate a larger batch size.
Evaluation strategy: Performed at the end of each epoch.
Logging strategy: Metrics logged at the end of each epoch.
Checkpointing: The best model was saved based on evaluation loss, with a maximum of 3 checkpoints retained.

Speeds, Sizes, Times

Training time: [More Information Needed]
Total epochs: 1000
Checkpoint size: [More Information Needed]

Compute Resources

The fine-tuning process was conducted on a system with the following specifications:

GPU: NVIDIA V100
Memory Usage: Monitored with Weights & Biases, optimized through gradient accumulation and mixed precision.
Training duration: [More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

<--- The evaluation was conducted on a held-out test set, which included images and captions not seen during training. The test set was designed to be diverse, representing various skin tones and dermatological conditions. ---->

Metrics

The primary evaluation metric was eval_loss. Additionally, accuracy and recall metrics were used to assess the model's ability to correctly match images with their corresponding textual descriptions.

Results

Best evaluation loss: [More Information Needed]
Top-1 accuracy: [More Information Needed]
Recall@k: [More Information Needed]

Summary

The model showed robust performance on the test set, particularly excelling in scenarios where both the image and text inputs were well-aligned with the training data. However, performance may vary with data significantly different from the training set.

Environmental Impact

Carbon emissions were monitored using Weights & Biases, with the following estimates:

Hardware Type: NVIDIA V100 GPU
Training Hours: [More Information Needed]
Cloud Provider: [More Information Needed]
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Technical Specifications

Model Architecture and Objective

The model integrates CLIP's vision encoder (clip-vit-base-patch32) with RoBERTa's language model (roberta-base) in a dual encoder setup. The objective was to fine-tune this model to improve its performance on image-text matching tasks, specifically within the medical domain, focusing on dermatological images.

Compute Infrastructure

Hardware

GPUs: NVIDIA V100
Memory: 16GB GPU memory, utilized efficiently with mixed precision training.

Software

Transformers version: [More Information Needed]
PyTorch version: [More Information Needed]

Citation

BibTeX:

@inproceedings{parthasarathy2024cliproberta,
  title={Fine-Tuned Vision-Text Dual Encoder Model for Image-Text Matching},
  author={Parthasarathy, Dhruv},
  year={2024},
  howpublished={\url{https://huggingface.co/your-username/clip-roberta-finetuned}},
}

Model Card Authors

This model card was prepared by Dhruv Parthasarathy.

Model Card Contact

For questions or issues with the model, please contact parthasarathy.d@northeastern.edu or linkedin.com/in/parthadhruv/.