bluelike commited on
Commit
6010982
1 Parent(s): b6241d7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +246 -67
README.md CHANGED
@@ -18,14 +18,13 @@ We're excited to unveil **Qwen2-VL**, the latest iteration of our Qwen-VL model,
18
 
19
  #### Key Enhancements:
20
 
21
- * **SoTA understanding of images of various resolution & ratio**: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealworldQA, MMBench, etc.
22
 
23
- * **Understanding videos of 20min+**: with the online streaming capabilities, Qwen2-VL can understand long videos by high-quality video-based question answering, dialog, content creation, etc.
24
 
25
- * **Agent that can operate your mobiles, robots, ...**: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
26
-
27
- * **Multilingual Support**: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.
28
 
 
29
 
30
  #### Model Architecture Updates:
31
 
@@ -92,17 +91,18 @@ pip install qwen-vl-utils
92
  Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
93
 
94
  ```python
95
-
96
  from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
97
  from qwen_vl_utils import process_vision_info
98
 
99
  # default: Load the model on the available device(s)
100
- model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", device_map="auto")
 
 
101
 
102
  # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
103
  # model = Qwen2VLForConditionalGeneration.from_pretrained(
104
- # "Qwen/Qwen2-VL-7B-Instruct",
105
- # torch_dtype=torch.bfloat16,
106
  # attn_implementation="flash_attention_2",
107
  # device_map="auto",
108
  # )
@@ -115,24 +115,47 @@ processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
115
  # max_pixels = 1280*28*28
116
  # processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
117
 
118
- messages = [{"role": "user", "content": [{"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"}, {"type": "text", "text": "Describe this image."}]}]
 
 
 
 
 
 
 
 
 
 
 
119
 
120
  # Preparation for inference
121
- text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 
 
122
  image_inputs, video_inputs = process_vision_info(messages)
123
- inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
 
 
 
 
 
 
 
124
 
125
  # Inference: Generation of the output
126
  generated_ids = model.generate(**inputs, max_new_tokens=128)
127
- generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
128
- output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
 
 
 
 
129
  print(output_text)
130
  ```
131
  <details>
132
  <summary>Without qwen_vl_utils</summary>
133
 
134
  ```python
135
-
136
  from PIL import Image
137
  import requests
138
  import torch
@@ -141,7 +164,9 @@ from typing import Dict
141
  from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
142
 
143
  # Load the model in half-precision on the available device(s)
144
- model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", device_map="auto")
 
 
145
  processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
146
 
147
  # Image
@@ -150,16 +175,13 @@ image = Image.open(requests.get(url, stream=True).raw)
150
 
151
  conversation = [
152
  {
153
- "role":"user",
154
- "content":[
155
  {
156
- "type":"image",
157
  },
158
- {
159
- "type":"text",
160
- "text":"Describe this image."
161
- }
162
- ]
163
  }
164
  ]
165
 
@@ -168,13 +190,20 @@ conversation = [
168
  text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
169
  # Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'
170
 
171
- inputs = processor(text=[text_prompt], images=[image], padding=True, return_tensors="pt")
172
- inputs = inputs.to('cuda')
 
 
173
 
174
  # Inference: Generation of the output
175
  output_ids = model.generate(**inputs, max_new_tokens=128)
176
- generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
177
- output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
 
 
 
 
 
178
  print(output_text)
179
  ```
180
  </details>
@@ -183,17 +212,39 @@ print(output_text)
183
 
184
  ```python
185
  # Messages containing multiple images and a text query
186
- messages = [{"role": "user", "content": [{"type": "image", "image": "file:///path/to/image1.jpg"}, {"type": "image", "image": "file:///path/to/image2.jpg"}, {"type": "text", "text": "Identify the similarities between these images."}]}]
 
 
 
 
 
 
 
 
 
187
 
188
  # Preparation for inference
189
- text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 
 
190
  image_inputs, video_inputs = process_vision_info(messages)
191
- inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
 
 
 
 
 
 
 
192
 
193
  # Inference
194
  generated_ids = model.generate(**inputs, max_new_tokens=128)
195
- generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
196
- output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
 
 
 
 
197
  print(output_text)
198
  ```
199
  </details>
@@ -202,21 +253,63 @@ print(output_text)
202
  <summary>Video inference</summary>
203
 
204
  ```python
205
-
206
  # Messages containing a images list as a video and a text query
207
- messages = [{"role": "user", "content": [{"type": "video", "video": ["file:///path/to/frame1.jpg", "file:///path/to/frame2.jpg", "file:///path/to/frame3.jpg", "file:///path/to/frame4.jpg"], 'fps': 1.0}, {"type": "text", "text": "Describe this video."}]}]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
208
  # Messages containing a video and a text query
209
- messages = [{"role": "user", "content": [{"type": "video", "video": "file:///path/to/video1.mp4", 'max_pixels': 360*420, 'fps': 1.0}, {"type": "text", "text": "Describe this video."}]}]
 
 
 
 
 
 
 
 
 
 
 
 
 
210
 
211
  # Preparation for inference
212
- text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 
 
213
  image_inputs, video_inputs = process_vision_info(messages)
214
- inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
 
 
 
 
 
 
 
215
 
216
  # Inference
217
  generated_ids = model.generate(**inputs, max_new_tokens=128)
218
- generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
219
- output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
 
 
 
 
220
  print(output_text)
221
  ```
222
  </details>
@@ -225,22 +318,47 @@ print(output_text)
225
  <summary>Batch inference</summary>
226
 
227
  ```python
228
-
229
  # Sample messages for batch inference
230
- messages1 = [{"role": "user", "content": [{"type": "image", "image": "file:///path/to/image1.jpg"}, {"type": "image", "image": "file:///path/to/image2.jpg"}, {"type": "text", "text": "What are the common elements in these pictures?"}]}]
231
- messages2 = [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who are you?"}]
 
 
 
 
 
 
 
 
 
 
 
 
232
  # Combine messages for batch processing
233
  messages = [messages1, messages1]
234
 
235
  # Preparation for batch inference
236
- texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in messages]
 
 
 
237
  image_inputs, video_inputs = process_vision_info(messages)
238
- inputs = processor(text=texts, images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
 
 
 
 
 
 
 
239
 
240
  # Batch Inference
241
  generated_ids = model.generate(**inputs, max_new_tokens=128)
242
- generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
243
- output_texts = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
 
 
 
 
244
  print(output_texts)
245
  ```
246
  </details>
@@ -252,22 +370,46 @@ For input images, we support local files, base64, and URLs. For videos, we curre
252
  ```python
253
  # You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
254
  ## Local file path
255
- messages = [{"role": "user", "content": [{"type": "image", "image": "file:///path/to/your/image.jpg"}, {"type": "text", "text": "Describe this image."}]}]
 
 
 
 
 
 
 
 
256
  ## Image URL
257
- messages = [{"role": "user", "content": [{"type": "image", "image": "http://path/to/your/image.jpg"}, {"type": "text", "text": "Describe this image."}]}]
 
 
 
 
 
 
 
 
258
  ## Base64 encoded image
259
- messages = [{"role": "user", "content": [{"type": "image", "image": "data:image;base64,/9j/..."}, {"type": "text", "text": "Describe this image."}]}]
 
 
 
 
 
 
 
 
260
  ```
261
  #### Image Resolution for performance boost
262
 
263
  The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage.
264
 
265
  ```python
266
-
267
- min_pixels = 256*28*28
268
- max_pixels = 1280*28*28
269
- processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
270
-
271
  ```
272
 
273
  Besides, We provide two methods for fine-grained control over the image size input to the model:
@@ -278,20 +420,49 @@ Besides, We provide two methods for fine-grained control over the image size inp
278
 
279
  ```python
280
  # min_pixels and max_pixels
281
- messages = [{"role": "user", "content": [{"type": "image", "image": "file:///path/to/your/image.jpg", "resized_height": 280, "resized_width": 420}, {"type": "text", "text": "Describe this image."}]}]
 
 
 
 
 
 
 
 
 
 
 
 
 
282
  # resized_height and resized_width
283
- messages = [{"role": "user", "content": [{"type": "image", "image": "file:///path/to/your/image.jpg", "min_pixels": 50176, "max_pixels": 50176}, {"type": "text", "text": "Describe this image."}]}]
 
 
 
 
 
 
 
 
 
 
 
 
 
284
  ```
285
 
286
- **Limitations:**
 
 
287
 
288
- 1. Does not support audio extraction from videos.
289
- 2. Limited to data available up until June 2023.
290
- 3. Limited coverage of character/IP recognition.
291
- 4. Complex instruction following capabilities need enhancement.
292
- 5. Counting abilities, particularly in complex scenarios, require improvement.
293
- 6. Handling of complex charts by the model still needs refinement.
294
- 7. The model performs poorly in spatial relationship reasoning, especially in reasoning about object positions in a 3D space.
 
295
 
296
 
297
  ## Citation
@@ -299,8 +470,16 @@ messages = [{"role": "user", "content": [{"type": "image", "image": "file:///pat
299
  If you find our work helpful, feel free to give us a cite.
300
 
301
  ```
302
- @article{qwen2vl,
303
- title={Qwen2-VL Technical Report},
 
304
  year={2024}
305
  }
 
 
 
 
 
 
 
306
  ```
 
18
 
19
  #### Key Enhancements:
20
 
21
+ * **Enhanced Image Comprehension**: We've significantly improved the model's ability to understand and interpret visual information, setting new benchmarks across key performance metrics.
22
 
23
+ * **Advanced Video Understanding**: Qwen2-VL now features superior online streaming capabilities, enabling real-time analysis of dynamic video content with remarkable accuracy.
24
 
25
+ * **Integrated Visual Agent Functionality**: Our model now seamlessly incorporates sophisticated system integration, transforming Qwen2-VL into a powerful visual agent capable of complex reasoning and decision-making.
 
 
26
 
27
+ * **Expanded Multilingual Support**: We've broadened our language capabilities to better serve a diverse global user base, making Qwen2-VL more accessible and effective across different linguistic contexts.
28
 
29
  #### Model Architecture Updates:
30
 
 
91
  Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
92
 
93
  ```python
 
94
  from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
95
  from qwen_vl_utils import process_vision_info
96
 
97
  # default: Load the model on the available device(s)
98
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
99
+ "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
100
+ )
101
 
102
  # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
103
  # model = Qwen2VLForConditionalGeneration.from_pretrained(
104
+ # "Qwen/Qwen2-VL-7B-Instruct",
105
+ # torch_dtype=torch.bfloat16,
106
  # attn_implementation="flash_attention_2",
107
  # device_map="auto",
108
  # )
 
115
  # max_pixels = 1280*28*28
116
  # processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
117
 
118
+ messages = [
119
+ {
120
+ "role": "user",
121
+ "content": [
122
+ {
123
+ "type": "image",
124
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
125
+ },
126
+ {"type": "text", "text": "Describe this image."},
127
+ ],
128
+ }
129
+ ]
130
 
131
  # Preparation for inference
132
+ text = processor.apply_chat_template(
133
+ messages, tokenize=False, add_generation_prompt=True
134
+ )
135
  image_inputs, video_inputs = process_vision_info(messages)
136
+ inputs = processor(
137
+ text=[text],
138
+ images=image_inputs,
139
+ videos=video_inputs,
140
+ padding=True,
141
+ return_tensors="pt",
142
+ )
143
+ inputs = inputs.to("cuda")
144
 
145
  # Inference: Generation of the output
146
  generated_ids = model.generate(**inputs, max_new_tokens=128)
147
+ generated_ids_trimmed = [
148
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
149
+ ]
150
+ output_text = processor.batch_decode(
151
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
152
+ )
153
  print(output_text)
154
  ```
155
  <details>
156
  <summary>Without qwen_vl_utils</summary>
157
 
158
  ```python
 
159
  from PIL import Image
160
  import requests
161
  import torch
 
164
  from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
165
 
166
  # Load the model in half-precision on the available device(s)
167
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
168
+ "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
169
+ )
170
  processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
171
 
172
  # Image
 
175
 
176
  conversation = [
177
  {
178
+ "role": "user",
179
+ "content": [
180
  {
181
+ "type": "image",
182
  },
183
+ {"type": "text", "text": "Describe this image."},
184
+ ],
 
 
 
185
  }
186
  ]
187
 
 
190
  text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
191
  # Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'
192
 
193
+ inputs = processor(
194
+ text=[text_prompt], images=[image], padding=True, return_tensors="pt"
195
+ )
196
+ inputs = inputs.to("cuda")
197
 
198
  # Inference: Generation of the output
199
  output_ids = model.generate(**inputs, max_new_tokens=128)
200
+ generated_ids = [
201
+ output_ids[len(input_ids) :]
202
+ for input_ids, output_ids in zip(inputs.input_ids, output_ids)
203
+ ]
204
+ output_text = processor.batch_decode(
205
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
206
+ )
207
  print(output_text)
208
  ```
209
  </details>
 
212
 
213
  ```python
214
  # Messages containing multiple images and a text query
215
+ messages = [
216
+ {
217
+ "role": "user",
218
+ "content": [
219
+ {"type": "image", "image": "file:///path/to/image1.jpg"},
220
+ {"type": "image", "image": "file:///path/to/image2.jpg"},
221
+ {"type": "text", "text": "Identify the similarities between these images."},
222
+ ],
223
+ }
224
+ ]
225
 
226
  # Preparation for inference
227
+ text = processor.apply_chat_template(
228
+ messages, tokenize=False, add_generation_prompt=True
229
+ )
230
  image_inputs, video_inputs = process_vision_info(messages)
231
+ inputs = processor(
232
+ text=[text],
233
+ images=image_inputs,
234
+ videos=video_inputs,
235
+ padding=True,
236
+ return_tensors="pt",
237
+ )
238
+ inputs = inputs.to("cuda")
239
 
240
  # Inference
241
  generated_ids = model.generate(**inputs, max_new_tokens=128)
242
+ generated_ids_trimmed = [
243
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
244
+ ]
245
+ output_text = processor.batch_decode(
246
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
247
+ )
248
  print(output_text)
249
  ```
250
  </details>
 
253
  <summary>Video inference</summary>
254
 
255
  ```python
 
256
  # Messages containing a images list as a video and a text query
257
+ messages = [
258
+ {
259
+ "role": "user",
260
+ "content": [
261
+ {
262
+ "type": "video",
263
+ "video": [
264
+ "file:///path/to/frame1.jpg",
265
+ "file:///path/to/frame2.jpg",
266
+ "file:///path/to/frame3.jpg",
267
+ "file:///path/to/frame4.jpg",
268
+ ],
269
+ "fps": 1.0,
270
+ },
271
+ {"type": "text", "text": "Describe this video."},
272
+ ],
273
+ }
274
+ ]
275
  # Messages containing a video and a text query
276
+ messages = [
277
+ {
278
+ "role": "user",
279
+ "content": [
280
+ {
281
+ "type": "video",
282
+ "video": "file:///path/to/video1.mp4",
283
+ "max_pixels": 360 * 420,
284
+ "fps": 1.0,
285
+ },
286
+ {"type": "text", "text": "Describe this video."},
287
+ ],
288
+ }
289
+ ]
290
 
291
  # Preparation for inference
292
+ text = processor.apply_chat_template(
293
+ messages, tokenize=False, add_generation_prompt=True
294
+ )
295
  image_inputs, video_inputs = process_vision_info(messages)
296
+ inputs = processor(
297
+ text=[text],
298
+ images=image_inputs,
299
+ videos=video_inputs,
300
+ padding=True,
301
+ return_tensors="pt",
302
+ )
303
+ inputs = inputs.to("cuda")
304
 
305
  # Inference
306
  generated_ids = model.generate(**inputs, max_new_tokens=128)
307
+ generated_ids_trimmed = [
308
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
309
+ ]
310
+ output_text = processor.batch_decode(
311
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
312
+ )
313
  print(output_text)
314
  ```
315
  </details>
 
318
  <summary>Batch inference</summary>
319
 
320
  ```python
 
321
  # Sample messages for batch inference
322
+ messages1 = [
323
+ {
324
+ "role": "user",
325
+ "content": [
326
+ {"type": "image", "image": "file:///path/to/image1.jpg"},
327
+ {"type": "image", "image": "file:///path/to/image2.jpg"},
328
+ {"type": "text", "text": "What are the common elements in these pictures?"},
329
+ ],
330
+ }
331
+ ]
332
+ messages2 = [
333
+ {"role": "system", "content": "You are a helpful assistant."},
334
+ {"role": "user", "content": "Who are you?"},
335
+ ]
336
  # Combine messages for batch processing
337
  messages = [messages1, messages1]
338
 
339
  # Preparation for batch inference
340
+ texts = [
341
+ processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
342
+ for msg in messages
343
+ ]
344
  image_inputs, video_inputs = process_vision_info(messages)
345
+ inputs = processor(
346
+ text=texts,
347
+ images=image_inputs,
348
+ videos=video_inputs,
349
+ padding=True,
350
+ return_tensors="pt",
351
+ )
352
+ inputs = inputs.to("cuda")
353
 
354
  # Batch Inference
355
  generated_ids = model.generate(**inputs, max_new_tokens=128)
356
+ generated_ids_trimmed = [
357
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
358
+ ]
359
+ output_texts = processor.batch_decode(
360
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
361
+ )
362
  print(output_texts)
363
  ```
364
  </details>
 
370
  ```python
371
  # You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
372
  ## Local file path
373
+ messages = [
374
+ {
375
+ "role": "user",
376
+ "content": [
377
+ {"type": "image", "image": "file:///path/to/your/image.jpg"},
378
+ {"type": "text", "text": "Describe this image."},
379
+ ],
380
+ }
381
+ ]
382
  ## Image URL
383
+ messages = [
384
+ {
385
+ "role": "user",
386
+ "content": [
387
+ {"type": "image", "image": "http://path/to/your/image.jpg"},
388
+ {"type": "text", "text": "Describe this image."},
389
+ ],
390
+ }
391
+ ]
392
  ## Base64 encoded image
393
+ messages = [
394
+ {
395
+ "role": "user",
396
+ "content": [
397
+ {"type": "image", "image": "data:image;base64,/9j/..."},
398
+ {"type": "text", "text": "Describe this image."},
399
+ ],
400
+ }
401
+ ]
402
  ```
403
  #### Image Resolution for performance boost
404
 
405
  The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs, such as a token count range of 256-1280, to balance speed and memory usage.
406
 
407
  ```python
408
+ min_pixels = 256 * 28 * 28
409
+ max_pixels = 1280 * 28 * 28
410
+ processor = AutoProcessor.from_pretrained(
411
+ "Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
412
+ )
413
  ```
414
 
415
  Besides, We provide two methods for fine-grained control over the image size input to the model:
 
420
 
421
  ```python
422
  # min_pixels and max_pixels
423
+ messages = [
424
+ {
425
+ "role": "user",
426
+ "content": [
427
+ {
428
+ "type": "image",
429
+ "image": "file:///path/to/your/image.jpg",
430
+ "resized_height": 280,
431
+ "resized_width": 420,
432
+ },
433
+ {"type": "text", "text": "Describe this image."},
434
+ ],
435
+ }
436
+ ]
437
  # resized_height and resized_width
438
+ messages = [
439
+ {
440
+ "role": "user",
441
+ "content": [
442
+ {
443
+ "type": "image",
444
+ "image": "file:///path/to/your/image.jpg",
445
+ "min_pixels": 50176,
446
+ "max_pixels": 50176,
447
+ },
448
+ {"type": "text", "text": "Describe this image."},
449
+ ],
450
+ }
451
+ ]
452
  ```
453
 
454
+ ## Limitations
455
+
456
+ While Qwen2-VL are applicable to a wide range of visual tasks, it is equally important to understand its limitations. Here are some known restrictions:
457
 
458
+ 1. Lack of Audio Support: The current model does **not comprehend audio information** within videos.
459
+ 2. Data timeliness: Our image dataset is **updated until June 2023**, and information subsequent to this date may not be covered.
460
+ 3. Constraints in Individuals and Intellectual Property (IP): The model's capacity to recognize specific individuals or IPs is limited, potentially failing to comprehensively cover all well-known personalities or brands.
461
+ 4. Limited Capacity for Complex Instruction: When faced with intricate multi-step instructions, the model's understanding and execution capabilities require enhancement.
462
+ 5. Insufficient Counting Accuracy: Particularly in complex scenes, the accuracy of object counting is not high, necessitating further improvements.
463
+ 6. Weak Spatial Reasoning Skills: Especially in 3D spaces, the model's inference of object positional relationships is inadequate, making it difficult to precisely judge the relative positions of objects.
464
+
465
+ These limitations serve as ongoing directions for model optimization and improvement, and we are committed to continually enhancing the model's performance and scope of application.
466
 
467
 
468
  ## Citation
 
470
  If you find our work helpful, feel free to give us a cite.
471
 
472
  ```
473
+ @article{Qwen2-VL,
474
+ title={Qwen2-VL},
475
+ author={Qwen team},
476
  year={2024}
477
  }
478
+
479
+ @article{Qwen-VL,
480
+ title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
481
+ author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
482
+ journal={arXiv preprint arXiv:2308.12966},
483
+ year={2023}
484
+ }
485
  ```