PaliGemma-3b-VQAv2 / README.md
MoRa2001's picture
Upload processor
f2b8ee1 verified
|
raw
history blame
2.77 kB
metadata
{}

Model Card for Fine-Tuned Paligemma-3B-PT-224 Model

This model is a fine-tuned version of google/paligemma-3b-pt-224 using the peft library. The fine-tuning process involved the Multimodal-Fatima/VQAv2_sample_train dataset, focusing on vision-language tasks.

Model Details

Model Description

This model is designed for vision-language tasks, fine-tuned to answer questions based on images and textual prompts. It leverages advanced quantization techniques and specific configurations to optimize performance and efficiency.

  • Developed by: AmmarAbdelhady
  • Model type: Vision-Language Model
  • Language(s) (NLP): English
  • Finetuned from model: google/paligemma-3b-pt-224

Model Sources

Uses

Direct Use

This model can be used directly for vision-language tasks, including image captioning and visual question answering.

Downstream Use

The model can be fine-tuned further for specific tasks or integrated into larger systems requiring vision-language capabilities.

Out-of-Scope Use

The model is not suitable for tasks unrelated to vision-language processing, such as purely text-based or purely image-based tasks without multimodal interaction.

Bias, Risks, and Limitations

The model may inherit biases from the training dataset, particularly in terms of visual and textual content. It is crucial to evaluate and mitigate these biases in downstream applications.

Recommendations

Users should be aware of the model's limitations and potential biases. It is recommended to perform thorough evaluations on diverse datasets to understand the model's performance across different scenarios.

How to Get Started with the Model

from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor
import torch
from PIL import Image
import requests

model = PaliGemmaForConditionalGeneration.from_pretrained('your_model_path')
processor = PaliGemmaProcessor.from_pretrained('your_model_path')

prompt = "What is on the flower?"
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg?download=true"
raw_image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=20)

print(processor.decode(output[0], skip_special_tokens=True))