File size: 2,768 Bytes

f2b8ee1
 
 
907072d
9734daa
907072d
9734daa
 
 
 
 
907072d
9734daa
907072d
 
 
 
9734daa
907072d
9734daa
907072d
 
9734daa
 
 
 
 
907072d
9734daa
907072d
9734daa
907072d
9734daa
 
 
907072d
9734daa
 
 
907072d
9734daa
 
 
907072d
9734daa
 
 
907072d
 
 
 
 
9734daa
907072d
 
9734daa
907072d
 
 
 
 
9734daa
907072d

---
{}
---
# Model Card for Fine-Tuned Paligemma-3B-PT-224 Model

This model is a fine-tuned version of `google/paligemma-3b-pt-224` using the `peft` library. The fine-tuning process involved the `Multimodal-Fatima/VQAv2_sample_train` dataset, focusing on vision-language tasks.

## Model Details

### Model Description

This model is designed for vision-language tasks, fine-tuned to answer questions based on images and textual prompts. It leverages advanced quantization techniques and specific configurations to optimize performance and efficiency.

- **Developed by:** [AmmarAbdelhady](https://ammar-abdelhady-ai.github.io/Ammar-Abdelhady-Portfolio/)
- **Model type:** Vision-Language Model
- **Language(s) (NLP):** English
- **Finetuned from model:** `google/paligemma-3b-pt-224`

### Model Sources

- **Repository:** [Vision-Language-Model-Fine-Tuning Notebook](https://github.com/Ammar-Abdelhady-ai/Vision-Language-Model-Fine-Tuning/blob/main/fine-tuning-of-paligemma-vision-language-model.ipynb)
- **Demo:** [Vision-Language-Model-Fine-Tuning](https://github.com/Ammar-Abdelhady-ai/Vision-Language-Model-Fine-Tuning)

## Uses

### Direct Use

This model can be used directly for vision-language tasks, including image captioning and visual question answering.

### Downstream Use

The model can be fine-tuned further for specific tasks or integrated into larger systems requiring vision-language capabilities.

### Out-of-Scope Use

The model is not suitable for tasks unrelated to vision-language processing, such as purely text-based or purely image-based tasks without multimodal interaction.

## Bias, Risks, and Limitations

The model may inherit biases from the training dataset, particularly in terms of visual and textual content. It is crucial to evaluate and mitigate these biases in downstream applications.

### Recommendations

Users should be aware of the model's limitations and potential biases. It is recommended to perform thorough evaluations on diverse datasets to understand the model's performance across different scenarios.

## How to Get Started with the Model

```python
from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor
import torch
from PIL import Image
import requests

model = PaliGemmaForConditionalGeneration.from_pretrained('your_model_path')
processor = PaliGemmaProcessor.from_pretrained('your_model_path')

prompt = "What is on the flower?"
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg?download=true"
raw_image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=20)

print(processor.decode(output[0], skip_special_tokens=True))