|
--- |
|
license: openrail |
|
inference: false |
|
pipeline_tag: image-to-text |
|
tags: |
|
- image-to-text |
|
- visual-question-answering |
|
- image-captioning |
|
datasets: |
|
- coco |
|
- textvqa |
|
- VQAv2 |
|
- OK-VQA |
|
- A-OKVQA |
|
language: |
|
- en |
|
|
|
--- |
|
This is the repo for the paper [PromptCap: Prompt-Guided Task-Aware Image Captioning](https://arxiv.org/abs/2211.09699). This paper is accepted to ICCV 2023 as [PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3](https://openaccess.thecvf.com/content/ICCV2023/html/Hu_PromptCap_Prompt-Guided_Image_Captioning_for_VQA_with_GPT-3_ICCV_2023_paper.html). |
|
|
|
|
|
We introduce PromptCap, a captioning model that can be controlled by natural language instruction. The instruction may contain a question that the user is interested in. |
|
For example, "what is the boy putting on?". PromptCap also supports generic caption, using the question "what does the image describe?" |
|
|
|
PromptCap can serve as a light-weight visual plug-in (much faster than BLIP-2) for LLM like GPT-3, ChatGPT, and other foundation models like Segment Anything and DINO. |
|
It achieves SOTA performance on COCO captioning (150 CIDEr). |
|
When paired with GPT-3, and conditioned on user question, PromptCap get SOTA performance on knowledge-based VQA tasks (60.4% on OK-VQA and 59.6% on A-OKVQA) |
|
|
|
# QuickStart |
|
|
|
## Installation |
|
``` |
|
pip install promptcap |
|
``` |
|
|
|
Two pipelines are included. One is for image captioning, and the other is for visual question answering. |
|
|
|
## Captioning Pipeline |
|
|
|
Please follow the prompt format, which will give the best performance. |
|
|
|
Generate a prompt-guided caption by following: |
|
```python |
|
import torch |
|
from promptcap import PromptCap |
|
|
|
model = PromptCap("tifa-benchmark/promptcap-coco-vqa") # also support OFA checkpoints. e.g. "OFA-Sys/ofa-large" |
|
|
|
if torch.cuda.is_available(): |
|
model.cuda() |
|
|
|
prompt = "please describe this image according to the given question: what piece of clothing is this boy putting on?" |
|
image = "glove_boy.jpeg" |
|
|
|
print(model.caption(prompt, image)) |
|
``` |
|
|
|
To try generic captioning, just use "what does the image describe?" |
|
|
|
```python |
|
prompt = "what does the image describe?" |
|
image = "glove_boy.jpeg" |
|
|
|
print(model.caption(prompt, image)) |
|
``` |
|
|
|
|
|
|
|
PromptCap also support taking OCR inputs: |
|
|
|
```python |
|
prompt = "please describe this image according to the given question: what year was this taken?" |
|
image = "dvds.jpg" |
|
ocr = "yip AE Mht juor 02/14/2012" |
|
|
|
print(model.caption(prompt, image, ocr)) |
|
``` |
|
|
|
|
|
|
|
## Visual Question Answering Pipeline |
|
|
|
Different from typical VQA models, which are doing classification on VQAv2, PromptCap is open-domain and can be paired with arbitrary text-QA models. |
|
Here we provide a pipeline for combining PromptCap with UnifiedQA. |
|
|
|
```python |
|
import torch |
|
from promptcap import PromptCap_VQA |
|
|
|
# QA model support all UnifiedQA variants. e.g. "allenai/unifiedqa-v2-t5-large-1251000" |
|
vqa_model = PromptCap_VQA(promptcap_model="tifa-benchmark/promptcap-coco-vqa", qa_model="allenai/unifiedqa-t5-base") |
|
|
|
if torch.cuda.is_available(): |
|
vqa_model.cuda() |
|
|
|
question = "what piece of clothing is this boy putting on?" |
|
image = "glove_boy.jpeg" |
|
|
|
print(vqa_model.vqa(question, image)) |
|
``` |
|
|
|
Similarly, PromptCap supports OCR inputs |
|
|
|
```python |
|
question = "what year was this taken?" |
|
image = "dvds.jpg" |
|
ocr = "yip AE Mht juor 02/14/2012" |
|
|
|
print(vqa_model.vqa(question, image, ocr=ocr)) |
|
``` |
|
|
|
Because of the flexibility of Unifiedqa, PromptCap also supports multiple-choice VQA |
|
|
|
```python |
|
question = "what piece of clothing is this boy putting on?" |
|
image = "glove_boy.jpeg" |
|
choices = ["gloves", "socks", "shoes", "coats"] |
|
print(vqa_model.vqa_multiple_choice(question, image, choices)) |
|
``` |
|
|
|
## Bibtex |
|
``` |
|
@article{hu2022promptcap, |
|
title={PromptCap: Prompt-Guided Task-Aware Image Captioning}, |
|
author={Hu, Yushi and Hua, Hang and Yang, Zhengyuan and Shi, Weijia and Smith, Noah A and Luo, Jiebo}, |
|
journal={arXiv preprint arXiv:2211.09699}, |
|
year={2022} |
|
} |
|
``` |