File size: 3,891 Bytes
b027fb0 502c77d b4cf00d 5d845fe 502c77d 68336b4 de7d203 44aa161 4101b50 44aa161 68336b4 7847f46 68336b4 5eb3f77 a094e0c 7cef707 68336b4 de7d203 68336b4 00fd1b8 68336b4 5eb3f77 f1e34bc 68336b4 5eb3f77 68336b4 de7d203 68336b4 5eb3f77 68336b4 a094e0c 68336b4 5eb3f77 68336b4 a5c6185 44aa161 a5c6185 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
---
license: openrail
inference: false
pipeline_tag: image-to-text
tags:
- image-to-text
- visual-question-answering
- image-captioning
datasets:
- coco
- textvqa
- VQAv2
- OK-VQA
- A-OKVQA
language:
- en
---
This is the repo for the paper [PromptCap: Prompt-Guided Task-Aware Image Captioning](https://arxiv.org/abs/2211.09699). This paper is accepted to ICCV 2023 as [PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3](https://openaccess.thecvf.com/content/ICCV2023/html/Hu_PromptCap_Prompt-Guided_Image_Captioning_for_VQA_with_GPT-3_ICCV_2023_paper.html).
We introduce PromptCap, a captioning model that can be controlled by natural language instruction. The instruction may contain a question that the user is interested in.
For example, "what is the boy putting on?". PromptCap also supports generic caption, using the question "what does the image describe?"
PromptCap can serve as a light-weight visual plug-in (much faster than BLIP-2) for LLM like GPT-3, ChatGPT, and other foundation models like Segment Anything and DINO.
It achieves SOTA performance on COCO captioning (150 CIDEr).
When paired with GPT-3, and conditioned on user question, PromptCap get SOTA performance on knowledge-based VQA tasks (60.4% on OK-VQA and 59.6% on A-OKVQA)
# QuickStart
## Installation
```
pip install promptcap
```
Two pipelines are included. One is for image captioning, and the other is for visual question answering.
## Captioning Pipeline
Please follow the prompt format, which will give the best performance.
Generate a prompt-guided caption by following:
```python
import torch
from promptcap import PromptCap
model = PromptCap("tifa-benchmark/promptcap-coco-vqa") # also support OFA checkpoints. e.g. "OFA-Sys/ofa-large"
if torch.cuda.is_available():
model.cuda()
prompt = "please describe this image according to the given question: what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"
print(model.caption(prompt, image))
```
To try generic captioning, just use "what does the image describe?"
```python
prompt = "what does the image describe?"
image = "glove_boy.jpeg"
print(model.caption(prompt, image))
```
PromptCap also support taking OCR inputs:
```python
prompt = "please describe this image according to the given question: what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"
print(model.caption(prompt, image, ocr))
```
## Visual Question Answering Pipeline
Different from typical VQA models, which are doing classification on VQAv2, PromptCap is open-domain and can be paired with arbitrary text-QA models.
Here we provide a pipeline for combining PromptCap with UnifiedQA.
```python
import torch
from promptcap import PromptCap_VQA
# QA model support all UnifiedQA variants. e.g. "allenai/unifiedqa-v2-t5-large-1251000"
vqa_model = PromptCap_VQA(promptcap_model="tifa-benchmark/promptcap-coco-vqa", qa_model="allenai/unifiedqa-t5-base")
if torch.cuda.is_available():
vqa_model.cuda()
question = "what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"
print(vqa_model.vqa(question, image))
```
Similarly, PromptCap supports OCR inputs
```python
question = "what year was this taken?"
image = "dvds.jpg"
ocr = "yip AE Mht juor 02/14/2012"
print(vqa_model.vqa(question, image, ocr=ocr))
```
Because of the flexibility of Unifiedqa, PromptCap also supports multiple-choice VQA
```python
question = "what piece of clothing is this boy putting on?"
image = "glove_boy.jpeg"
choices = ["gloves", "socks", "shoes", "coats"]
print(vqa_model.vqa_multiple_choice(question, image, choices))
```
## Bibtex
```
@article{hu2022promptcap,
title={PromptCap: Prompt-Guided Task-Aware Image Captioning},
author={Hu, Yushi and Hua, Hang and Yang, Zhengyuan and Shi, Weijia and Smith, Noah A and Luo, Jiebo},
journal={arXiv preprint arXiv:2211.09699},
year={2022}
}
``` |