metadata
license: mit
language:
- en
library_name: transformers
inference: false
pipeline_tag: image-text-to-text
Sharded BLIP-2 Model Card - flan-t5-xl
This is a sharded version of the blip2-flan-t5-xl which leverages Flan T5-xl for image-to-text tasks such as image captioning and visual question answering.
- this model repo is sharded so it can be easily loaded on low-RAM Colab runtimes :)
- Refer to the original model card for more details about the model description, intended uses, and limitations, as well as instructions for how to use the model on CPU and GPU in different precisions.
Usage
Refer to the original model card for details or see this blog post. Here is how you can use it on CPU:
Install
Requires the current main
of transformers (at time of writing):
pip install accelerate git+https://github.com/huggingface/transformers.git -U -q
Use (this is for CPU, check out the original model card/blog for fp16
and int8
usage)
import requests
from PIL import Image
from transformers import BlipProcessor, Blip2ForConditionalGeneration
model_name = "ethzanalytics/blip2-flan-t5-xl-sharded"
processor = BlipProcessor.from_pretrained(model_name)
model = Blip2ForConditionalGeneration.from_pretrained(model_name)
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))