Update README.md
Browse files
README.md
CHANGED
@@ -1,5 +1,7 @@
|
|
1 |
---
|
2 |
-
license:
|
|
|
|
|
3 |
pipeline_tag: image-text-to-text
|
4 |
library_name: transformers
|
5 |
base_model:
|
@@ -65,7 +67,7 @@ To construct this dataset, we propose an efficient data construction pipeline. S
|
|
65 |
|
66 |
- **For samples with clear ground truths:**
|
67 |
the model is prompted to first provide the reasoning process and then give the final answer in the format like `Final Answer: ***`.
|
68 |
-
Responses matching the ground truth answer constitute the positive set \\(mathcal{Y}_p\\), while those that do not match make up the negative set \\(\mathcal{Y}_n\\). Additionally, responses that fail to provide a clear final answer are also merged into \\(\mathcal{Y}_n\\).
|
69 |
Given these responses labeled as positive or negative, we build the preference pairs by selecting a chosen response \\(y_c\\) from \\(\mathcal{Y}_p\\) and a negative response \\(y_r\\) from \\(\mathcal{Y}_n\\).
|
70 |
|
71 |
- **For samples without clear ground truths:**
|
@@ -160,7 +162,7 @@ To comprehensively compare InternVL's performance before and after MPO, we emplo
|
|
160 |
|
161 |
## Quick Start
|
162 |
|
163 |
-
We provide an example code to run `InternVL2_5-
|
164 |
|
165 |
> Please use transformers>=4.37.2 to ensure the model works normally.
|
166 |
|
@@ -171,7 +173,7 @@ We provide an example code to run `InternVL2_5-1B` using `transformers`.
|
|
171 |
```python
|
172 |
import torch
|
173 |
from transformers import AutoTokenizer, AutoModel
|
174 |
-
path = "OpenGVLab/InternVL2_5-
|
175 |
model = AutoModel.from_pretrained(
|
176 |
path,
|
177 |
torch_dtype=torch.bfloat16,
|
@@ -185,7 +187,7 @@ model = AutoModel.from_pretrained(
|
|
185 |
```python
|
186 |
import torch
|
187 |
from transformers import AutoTokenizer, AutoModel
|
188 |
-
path = "OpenGVLab/InternVL2_5-
|
189 |
model = AutoModel.from_pretrained(
|
190 |
path,
|
191 |
torch_dtype=torch.bfloat16,
|
@@ -230,8 +232,8 @@ def split_model(model_name):
|
|
230 |
|
231 |
return device_map
|
232 |
|
233 |
-
path = "OpenGVLab/InternVL2_5-
|
234 |
-
device_map = split_model('InternVL2_5-
|
235 |
model = AutoModel.from_pretrained(
|
236 |
path,
|
237 |
torch_dtype=torch.bfloat16,
|
@@ -244,6 +246,7 @@ model = AutoModel.from_pretrained(
|
|
244 |
### Inference with Transformers
|
245 |
|
246 |
```python
|
|
|
247 |
import numpy as np
|
248 |
import torch
|
249 |
import torchvision.transforms as T
|
@@ -326,14 +329,44 @@ def load_image(image_file, input_size=448, max_num=12):
|
|
326 |
pixel_values = torch.stack(pixel_values)
|
327 |
return pixel_values
|
328 |
|
329 |
-
|
330 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
331 |
model = AutoModel.from_pretrained(
|
332 |
path,
|
333 |
torch_dtype=torch.bfloat16,
|
|
|
334 |
low_cpu_mem_usage=True,
|
335 |
use_flash_attn=True,
|
336 |
-
trust_remote_code=True
|
|
|
337 |
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
|
338 |
|
339 |
# set the max number of tiles in `max_num`
|
@@ -510,9 +543,9 @@ LMDeploy abstracts the complex inference process of multi-modal Vision-Language
|
|
510 |
from lmdeploy import pipeline, TurbomindEngineConfig
|
511 |
from lmdeploy.vl import load_image
|
512 |
|
513 |
-
model = 'OpenGVLab/InternVL2_5-
|
514 |
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
|
515 |
-
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
|
516 |
response = pipe(('describe this image', image))
|
517 |
print(response.text)
|
518 |
```
|
@@ -528,8 +561,8 @@ from lmdeploy import pipeline, TurbomindEngineConfig
|
|
528 |
from lmdeploy.vl import load_image
|
529 |
from lmdeploy.vl.constants import IMAGE_TOKEN
|
530 |
|
531 |
-
model = 'OpenGVLab/InternVL2_5-
|
532 |
-
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
|
533 |
|
534 |
image_urls=[
|
535 |
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
|
@@ -550,8 +583,8 @@ Conducting inference with batch prompts is quite straightforward; just place the
|
|
550 |
from lmdeploy import pipeline, TurbomindEngineConfig
|
551 |
from lmdeploy.vl import load_image
|
552 |
|
553 |
-
model = 'OpenGVLab/InternVL2_5-
|
554 |
-
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
|
555 |
|
556 |
image_urls=[
|
557 |
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
|
@@ -570,8 +603,8 @@ There are two ways to do the multi-turn conversations with the pipeline. One is
|
|
570 |
from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
|
571 |
from lmdeploy.vl import load_image
|
572 |
|
573 |
-
model = 'OpenGVLab/InternVL2_5-
|
574 |
-
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
|
575 |
|
576 |
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
|
577 |
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
|
@@ -586,7 +619,7 @@ print(sess.response.text)
|
|
586 |
LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
|
587 |
|
588 |
```shell
|
589 |
-
lmdeploy serve api_server OpenGVLab/InternVL2_5-
|
590 |
```
|
591 |
|
592 |
To use the OpenAI-style interface, you need to install OpenAI:
|
@@ -625,7 +658,7 @@ print(response)
|
|
625 |
|
626 |
## License
|
627 |
|
628 |
-
This project is released under the MIT License. This project uses the pre-trained Qwen2.5-
|
629 |
|
630 |
## Citation
|
631 |
|
|
|
1 |
---
|
2 |
+
license: other
|
3 |
+
license_name: qwen
|
4 |
+
license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
|
5 |
pipeline_tag: image-text-to-text
|
6 |
library_name: transformers
|
7 |
base_model:
|
|
|
67 |
|
68 |
- **For samples with clear ground truths:**
|
69 |
the model is prompted to first provide the reasoning process and then give the final answer in the format like `Final Answer: ***`.
|
70 |
+
Responses matching the ground truth answer constitute the positive set \\(\mathcal{Y}_p\\), while those that do not match make up the negative set \\(\mathcal{Y}_n\\). Additionally, responses that fail to provide a clear final answer are also merged into \\(\mathcal{Y}_n\\).
|
71 |
Given these responses labeled as positive or negative, we build the preference pairs by selecting a chosen response \\(y_c\\) from \\(\mathcal{Y}_p\\) and a negative response \\(y_r\\) from \\(\mathcal{Y}_n\\).
|
72 |
|
73 |
- **For samples without clear ground truths:**
|
|
|
162 |
|
163 |
## Quick Start
|
164 |
|
165 |
+
We provide an example code to run `InternVL2_5-78B-MPO` using `transformers`.
|
166 |
|
167 |
> Please use transformers>=4.37.2 to ensure the model works normally.
|
168 |
|
|
|
173 |
```python
|
174 |
import torch
|
175 |
from transformers import AutoTokenizer, AutoModel
|
176 |
+
path = "OpenGVLab/InternVL2_5-78B-MPO"
|
177 |
model = AutoModel.from_pretrained(
|
178 |
path,
|
179 |
torch_dtype=torch.bfloat16,
|
|
|
187 |
```python
|
188 |
import torch
|
189 |
from transformers import AutoTokenizer, AutoModel
|
190 |
+
path = "OpenGVLab/InternVL2_5-78B-MPO"
|
191 |
model = AutoModel.from_pretrained(
|
192 |
path,
|
193 |
torch_dtype=torch.bfloat16,
|
|
|
232 |
|
233 |
return device_map
|
234 |
|
235 |
+
path = "OpenGVLab/InternVL2_5-78B-MPO"
|
236 |
+
device_map = split_model('InternVL2_5-78B')
|
237 |
model = AutoModel.from_pretrained(
|
238 |
path,
|
239 |
torch_dtype=torch.bfloat16,
|
|
|
246 |
### Inference with Transformers
|
247 |
|
248 |
```python
|
249 |
+
import math
|
250 |
import numpy as np
|
251 |
import torch
|
252 |
import torchvision.transforms as T
|
|
|
329 |
pixel_values = torch.stack(pixel_values)
|
330 |
return pixel_values
|
331 |
|
332 |
+
def split_model(model_name):
|
333 |
+
device_map = {}
|
334 |
+
world_size = torch.cuda.device_count()
|
335 |
+
num_layers = {
|
336 |
+
'InternVL2_5-1B': 24, 'InternVL2_5-2B': 24, 'InternVL2_5-4B': 36, 'InternVL2_5-8B': 32,
|
337 |
+
'InternVL2_5-26B': 48, 'InternVL2_5-38B': 64, 'InternVL2_5-78B': 80}[model_name]
|
338 |
+
# Since the first GPU will be used for ViT, treat it as half a GPU.
|
339 |
+
num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
|
340 |
+
num_layers_per_gpu = [num_layers_per_gpu] * world_size
|
341 |
+
num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
|
342 |
+
layer_cnt = 0
|
343 |
+
for i, num_layer in enumerate(num_layers_per_gpu):
|
344 |
+
for j in range(num_layer):
|
345 |
+
device_map[f'language_model.model.layers.{layer_cnt}'] = i
|
346 |
+
layer_cnt += 1
|
347 |
+
device_map['vision_model'] = 0
|
348 |
+
device_map['mlp1'] = 0
|
349 |
+
device_map['language_model.model.tok_embeddings'] = 0
|
350 |
+
device_map['language_model.model.embed_tokens'] = 0
|
351 |
+
device_map['language_model.output'] = 0
|
352 |
+
device_map['language_model.model.norm'] = 0
|
353 |
+
device_map['language_model.lm_head'] = 0
|
354 |
+
device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
|
355 |
+
|
356 |
+
return device_map
|
357 |
+
|
358 |
+
# If you set `load_in_8bit=True`, you will need two 80GB GPUs.
|
359 |
+
# If you set `load_in_8bit=False`, you will need at least three 80GB GPUs.
|
360 |
+
path = 'OpenGVLab/InternVL2_5-78B-MPO'
|
361 |
+
device_map = split_model('InternVL2_5-78B')
|
362 |
model = AutoModel.from_pretrained(
|
363 |
path,
|
364 |
torch_dtype=torch.bfloat16,
|
365 |
+
load_in_8bit=False,
|
366 |
low_cpu_mem_usage=True,
|
367 |
use_flash_attn=True,
|
368 |
+
trust_remote_code=True,
|
369 |
+
device_map=device_map).eval()
|
370 |
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
|
371 |
|
372 |
# set the max number of tiles in `max_num`
|
|
|
543 |
from lmdeploy import pipeline, TurbomindEngineConfig
|
544 |
from lmdeploy.vl import load_image
|
545 |
|
546 |
+
model = 'OpenGVLab/InternVL2_5-78B-MPO'
|
547 |
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
|
548 |
+
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
|
549 |
response = pipe(('describe this image', image))
|
550 |
print(response.text)
|
551 |
```
|
|
|
561 |
from lmdeploy.vl import load_image
|
562 |
from lmdeploy.vl.constants import IMAGE_TOKEN
|
563 |
|
564 |
+
model = 'OpenGVLab/InternVL2_5-78B-MPO'
|
565 |
+
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
|
566 |
|
567 |
image_urls=[
|
568 |
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
|
|
|
583 |
from lmdeploy import pipeline, TurbomindEngineConfig
|
584 |
from lmdeploy.vl import load_image
|
585 |
|
586 |
+
model = 'OpenGVLab/InternVL2_5-78B-MPO'
|
587 |
+
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
|
588 |
|
589 |
image_urls=[
|
590 |
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
|
|
|
603 |
from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
|
604 |
from lmdeploy.vl import load_image
|
605 |
|
606 |
+
model = 'OpenGVLab/InternVL2_5-78B-MPO'
|
607 |
+
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192, tp=4))
|
608 |
|
609 |
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
|
610 |
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
|
|
|
619 |
LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
|
620 |
|
621 |
```shell
|
622 |
+
lmdeploy serve api_server OpenGVLab/InternVL2_5-78B-MPO --server-port 23333 --tp 4
|
623 |
```
|
624 |
|
625 |
To use the OpenAI-style interface, you need to install OpenAI:
|
|
|
658 |
|
659 |
## License
|
660 |
|
661 |
+
This project is released under the MIT License. This project uses the pre-trained Qwen2.5-72B-Instruct as a component, which is licensed under the Qwen License.
|
662 |
|
663 |
## Citation
|
664 |
|