zayed commited on
Commit
755e8ae
·
verified ·
1 Parent(s): f23d2e6

Upload 5 files

Browse files
Files changed (5) hide show
  1. README.md +151 -0
  2. special_tokens_map.json +7 -0
  3. tokenizer.json +0 -0
  4. tokenizer_config.json +21 -0
  5. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: image-to-text
3
+ tags:
4
+ - image-captioning
5
+ languages:
6
+ - en
7
+ license: bsd-3-clause
8
+ ---
9
+
10
+ This isi the BLIP salesforce large image captioning model with small adjustments to the paramaters on the back end for testing - note in particular the length of reply is increased.
11
+
12
+
13
+ # BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
14
+
15
+ Model card for image captioning pretrained on COCO dataset - base architecture (with ViT large backbone).
16
+
17
+ | ![BLIP.gif](https://cdn-uploads.huggingface.co/production/uploads/1670928184033-62441d1d9fdefb55a0b7d12c.gif) |
18
+ |:--:|
19
+ | <b> Pull figure from BLIP official repo | Image source: https://github.com/salesforce/BLIP </b>|
20
+
21
+ ## TL;DR
22
+
23
+ Authors from the [paper](https://arxiv.org/abs/2201.12086) write in the abstract:
24
+
25
+ *Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner. Code, models, and datasets are released.*
26
+
27
+ ## Usage
28
+
29
+ You can use this model for conditional and un-conditional image captioning
30
+
31
+ ### Using the Pytorch model
32
+
33
+ #### Running the model on CPU
34
+
35
+ <details>
36
+ <summary> Click to expand </summary>
37
+
38
+ ```python
39
+ import requests
40
+ from PIL import Image
41
+ from transformers import BlipProcessor, BlipForConditionalGeneration
42
+
43
+ processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
44
+ model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
45
+
46
+ img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
47
+ raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
48
+
49
+ # conditional image captioning
50
+ text = "a photography of"
51
+ inputs = processor(raw_image, text, return_tensors="pt")
52
+
53
+ out = model.generate(**inputs)
54
+ print(processor.decode(out[0], skip_special_tokens=True))
55
+
56
+ # unconditional image captioning
57
+ inputs = processor(raw_image, return_tensors="pt")
58
+
59
+ out = model.generate(**inputs)
60
+ print(processor.decode(out[0], skip_special_tokens=True))
61
+ ```
62
+ </details>
63
+
64
+ #### Running the model on GPU
65
+
66
+ ##### In full precision
67
+
68
+ <details>
69
+ <summary> Click to expand </summary>
70
+
71
+ ```python
72
+ import requests
73
+ from PIL import Image
74
+ from transformers import BlipProcessor, BlipForConditionalGeneration
75
+
76
+ processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
77
+ model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large").to("cuda")
78
+
79
+ img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
80
+ raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
81
+
82
+ # conditional image captioning
83
+ text = "a photography of"
84
+ inputs = processor(raw_image, text, return_tensors="pt").to("cuda")
85
+
86
+ out = model.generate(**inputs)
87
+ print(processor.decode(out[0], skip_special_tokens=True))
88
+
89
+ # unconditional image captioning
90
+ inputs = processor(raw_image, return_tensors="pt").to("cuda")
91
+
92
+ out = model.generate(**inputs)
93
+ print(processor.decode(out[0], skip_special_tokens=True))
94
+ ```
95
+ </details>
96
+
97
+ ##### In half precision (`float16`)
98
+
99
+ <details>
100
+ <summary> Click to expand </summary>
101
+
102
+ ```python
103
+ import torch
104
+ import requests
105
+ from PIL import Image
106
+ from transformers import BlipProcessor, BlipForConditionalGeneration
107
+
108
+ processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
109
+ model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large", torch_dtype=torch.float16).to("cuda")
110
+
111
+ img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
112
+ raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
113
+
114
+ # conditional image captioning
115
+ text = "a photography of"
116
+ inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)
117
+
118
+ out = model.generate(**inputs)
119
+ print(processor.decode(out[0], skip_special_tokens=True))
120
+ # >>> a photography of a woman and her dog
121
+
122
+ # unconditional image captioning
123
+ inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
124
+
125
+ out = model.generate(**inputs)
126
+ print(processor.decode(out[0], skip_special_tokens=True))
127
+ >>> a woman sitting on the beach with her dog
128
+ ```
129
+ </details>
130
+
131
+ ## BibTex and citation info
132
+
133
+ ```
134
+ @misc{https://doi.org/10.48550/arxiv.2201.12086,
135
+ doi = {10.48550/ARXIV.2201.12086},
136
+
137
+ url = {https://arxiv.org/abs/2201.12086},
138
+
139
+ author = {Li, Junnan and Li, Dongxu and Xiong, Caiming and Hoi, Steven},
140
+
141
+ keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
142
+
143
+ title = {BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation},
144
+
145
+ publisher = {arXiv},
146
+
147
+ year = {2022},
148
+
149
+ copyright = {Creative Commons Attribution 4.0 International}
150
+ }
151
+ ```
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "do_basic_tokenize": true,
4
+ "do_lower_case": true,
5
+ "mask_token": "[MASK]",
6
+ "model_max_length": 512,
7
+ "name_or_path": "Salesforce/blip-image-captioning-large",
8
+ "never_split": null,
9
+ "pad_token": "[PAD]",
10
+ "processor_class": "BlipProcessor",
11
+ "sep_token": "[SEP]",
12
+ "special_tokens_map_file": null,
13
+ "strip_accents": null,
14
+ "tokenize_chinese_chars": true,
15
+ "tokenizer_class": "BertTokenizer",
16
+ "unk_token": "[UNK]",
17
+ "model_input_names": [
18
+ "input_ids",
19
+ "attention_mask"
20
+ ]
21
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff