PaliGemma-3b-VQAv2 / README.md
MoRa2001's picture
Upload processor
f2b8ee1 verified
---
{}
---
# Model Card for Fine-Tuned Paligemma-3B-PT-224 Model
This model is a fine-tuned version of `google/paligemma-3b-pt-224` using the `peft` library. The fine-tuning process involved the `Multimodal-Fatima/VQAv2_sample_train` dataset, focusing on vision-language tasks.
## Model Details
### Model Description
This model is designed for vision-language tasks, fine-tuned to answer questions based on images and textual prompts. It leverages advanced quantization techniques and specific configurations to optimize performance and efficiency.
- **Developed by:** [AmmarAbdelhady](https://ammar-abdelhady-ai.github.io/Ammar-Abdelhady-Portfolio/)
- **Model type:** Vision-Language Model
- **Language(s) (NLP):** English
- **Finetuned from model:** `google/paligemma-3b-pt-224`
### Model Sources
- **Repository:** [Vision-Language-Model-Fine-Tuning Notebook](https://github.com/Ammar-Abdelhady-ai/Vision-Language-Model-Fine-Tuning/blob/main/fine-tuning-of-paligemma-vision-language-model.ipynb)
- **Demo:** [Vision-Language-Model-Fine-Tuning](https://github.com/Ammar-Abdelhady-ai/Vision-Language-Model-Fine-Tuning)
## Uses
### Direct Use
This model can be used directly for vision-language tasks, including image captioning and visual question answering.
### Downstream Use
The model can be fine-tuned further for specific tasks or integrated into larger systems requiring vision-language capabilities.
### Out-of-Scope Use
The model is not suitable for tasks unrelated to vision-language processing, such as purely text-based or purely image-based tasks without multimodal interaction.
## Bias, Risks, and Limitations
The model may inherit biases from the training dataset, particularly in terms of visual and textual content. It is crucial to evaluate and mitigate these biases in downstream applications.
### Recommendations
Users should be aware of the model's limitations and potential biases. It is recommended to perform thorough evaluations on diverse datasets to understand the model's performance across different scenarios.
## How to Get Started with the Model
```python
from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor
import torch
from PIL import Image
import requests
model = PaliGemmaForConditionalGeneration.from_pretrained('your_model_path')
processor = PaliGemmaProcessor.from_pretrained('your_model_path')
prompt = "What is on the flower?"
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg?download=true"
raw_image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=20)
print(processor.decode(output[0], skip_special_tokens=True))