# Model Card for Fine-Tuned Paligemma-3B-PT-224 Model This model is a fine-tuned version of `google/paligemma-3b-pt-224` using the `peft` library. The fine-tuning process involved the `Multimodal-Fatima/VQAv2_sample_train` dataset, focusing on vision-language tasks. ## Model Details ### Model Description This model is designed for vision-language tasks, fine-tuned to answer questions based on images and textual prompts. It leverages advanced quantization techniques and specific configurations to optimize performance and efficiency. - **Developed by:** [AmmarAbdelhady](https://ammar-abdelhady-ai.github.io/Ammar-Abdelhady-Portfolio/) - **Model type:** Vision-Language Model - **Language(s) (NLP):** English - **Finetuned from model:** `google/paligemma-3b-pt-224` ### Model Sources - **Repository:** [Vision-Language-Model-Fine-Tuning Notebook](https://github.com/Ammar-Abdelhady-ai/Vision-Language-Model-Fine-Tuning/blob/main/fine-tuning-of-paligemma-vision-language-model.ipynb) - **Demo:** [Vision-Language-Model-Fine-Tuning](https://github.com/Ammar-Abdelhady-ai/Vision-Language-Model-Fine-Tuning) ## Uses ### Direct Use This model can be used directly for vision-language tasks, including image captioning and visual question answering. ### Downstream Use The model can be fine-tuned further for specific tasks or integrated into larger systems requiring vision-language capabilities. ### Out-of-Scope Use The model is not suitable for tasks unrelated to vision-language processing, such as purely text-based or purely image-based tasks without multimodal interaction. ## Bias, Risks, and Limitations The model may inherit biases from the training dataset, particularly in terms of visual and textual content. It is crucial to evaluate and mitigate these biases in downstream applications. ### Recommendations Users should be aware of the model's limitations and potential biases. It is recommended to perform thorough evaluations on diverse datasets to understand the model's performance across different scenarios. ## How to Get Started with the Model ```python from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor import torch from PIL import Image import requests model = PaliGemmaForConditionalGeneration.from_pretrained('your_model_path') processor = PaliGemmaProcessor.from_pretrained('your_model_path') prompt = "What is on the flower?" image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg?download=true" raw_image = Image.open(requests.get(image_url, stream=True).raw) inputs = processor(prompt, raw_image, return_tensors="pt") output = model.generate(**inputs, max_new_tokens=20) print(processor.decode(output[0], skip_special_tokens=True))