Ubuntu
commited on
Commit
•
ba40b0b
1
Parent(s):
e8b84fe
english
Browse files- README.md +2 -4
- README_en.md +100 -0
README.md
CHANGED
@@ -15,7 +15,7 @@ tags:
|
|
15 |
inference: false
|
16 |
---
|
17 |
|
18 |
-
#
|
19 |
|
20 |
GLM-4V-9B 是智谱 AI 推出的最新一代预训练模型 GLM-4 系列中的开源多模态版本。
|
21 |
**GLM-4V-9B** 具备 1120 * 1120 高分辨率下的中英双语多轮对话能力,在中英文综合能力、感知推理、文字识别、图表理解等多方面多模态评测中,GLM-4V-9B 表现出超越 GPT-4-turbo-2024-04-09、Gemini
|
@@ -73,12 +73,10 @@ with torch.no_grad():
|
|
73 |
print(tokenizer.decode(outputs[0]))
|
74 |
```
|
75 |
|
76 |
-
## 协议
|
77 |
|
78 |
GLM-4 模型的权重的使用则需要遵循 [LICENSE](LICENSE)。
|
79 |
|
80 |
-
Rhe use of the GLM-4 model weights needs to comply with the [LICENSE](LICENSE).
|
81 |
-
|
82 |
## 引用
|
83 |
|
84 |
如果你觉得我们的工作有帮助的话,请考虑引用下列论文。
|
|
|
15 |
inference: false
|
16 |
---
|
17 |
|
18 |
+
# GLM-4V-9B
|
19 |
|
20 |
GLM-4V-9B 是智谱 AI 推出的最新一代预训练模型 GLM-4 系列中的开源多模态版本。
|
21 |
**GLM-4V-9B** 具备 1120 * 1120 高分辨率下的中英双语多轮对话能力,在中英文综合能力、感知推理、文字识别、图表理解等多方面多模态评测中,GLM-4V-9B 表现出超越 GPT-4-turbo-2024-04-09、Gemini
|
|
|
73 |
print(tokenizer.decode(outputs[0]))
|
74 |
```
|
75 |
|
76 |
+
## 协议
|
77 |
|
78 |
GLM-4 模型的权重的使用则需要遵循 [LICENSE](LICENSE)。
|
79 |
|
|
|
|
|
80 |
## 引用
|
81 |
|
82 |
如果你觉得我们的工作有帮助的话,请考虑引用下列论文。
|
README_en.md
ADDED
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# GLM-4V-9B
|
2 |
+
|
3 |
+
GLM-4V-9B is an open source multimodal version of the latest generation of pre-trained models in the GLM-4 series launched by Zhipu AI.
|
4 |
+
**GLM-4V-9B** has the ability to conduct multi-round conversations in Chinese and English at a high resolution of 1120 * 1120. In multimodal evaluations of comprehensive Chinese and English abilities, perceptual reasoning, text recognition, and chart understanding, GLM-4V-9B has shown superior performance over GPT-4-turbo-2024-04-09, Gemini
|
5 |
+
1.0 Pro, Qwen-VL-Max, and Claude 3 Opus.
|
6 |
+
|
7 |
+
### Multimodal
|
8 |
+
|
9 |
+
GLM-4V-9B is a multimodal language model with visual understanding capabilities. The evaluation results of its related classic tasks are as follows:
|
10 |
+
|
11 |
+
|
12 |
+
| | **MMBench-EN-Test** | **MMBench-CN-Test** | **SEEDBench_IMG** | **MMStar** | **MMMU** | **MME** | **HallusionBench** | **AI2D** | **OCRBench** |
|
13 |
+
|-------------------------|---------------------|---------------------|-------------------|------------|----------|---------|--------------------|----------|--------------|
|
14 |
+
| | 英文综合 | 中文综合 | 综合能力 | 综合能力 | 学科综合 | 感知推理 | 幻觉性 | 图表理解 | 文字识别 |
|
15 |
+
| **GPT-4o, 20240513** | 83.4 | 82.1 | 77.1 | 63.9 | 69.2 | 2310.3 | 55 | 84.6 | 736 |
|
16 |
+
| **GPT-4v, 20240409** | 81 | 80.2 | 73 | 56 | 61.7 | 2070.2 | 43.9 | 78.6 | 656 |
|
17 |
+
| **GPT-4v, 20231106** | 77 | 74.4 | 72.3 | 49.7 | 53.8 | 1771.5 | 46.5 | 75.9 | 516 |
|
18 |
+
| **InternVL-Chat-V1.5** | 82.3 | 80.7 | 75.2 | 57.1 | 46.8 | 2189.6 | 47.4 | 80.6 | 720 |
|
19 |
+
| **LlaVA-Next-Yi-34B** | 81.1 | 79 | 75.7 | 51.6 | 48.8 | 2050.2 | 34.8 | 78.9 | 574 |
|
20 |
+
| **Step-1V** | 80.7 | 79.9 | 70.3 | 50 | 49.9 | 2206.4 | 48.4 | 79.2 | 625 |
|
21 |
+
| **MiniCPM-Llama3-V2.5** | 77.6 | 73.8 | 72.3 | 51.8 | 45.8 | 2024.6 | 42.4 | 78.4 | 725 |
|
22 |
+
| **Qwen-VL-Max** | 77.6 | 75.7 | 72.7 | 49.5 | 52 | 2281.7 | 41.2 | 75.7 | 684 |
|
23 |
+
| **GeminiProVision** | 73.6 | 74.3 | 70.7 | 38.6 | 49 | 2148.9 | 45.7 | 72.9 | 680 |
|
24 |
+
| **Claude-3V Opus** | 63.3 | 59.2 | 64 | 45.7 | 54.9 | 1586.8 | 37.8 | 70.6 | 694 |
|
25 |
+
| **GLM-4v-9B** | 81.1 | 79.4 | 76.8 | 58.7 | 47.2 | 2163.8 | 46.6 | 81.1 | 786 |
|
26 |
+
|
27 |
+
|
28 |
+
**This repository is the model repository of GLM-4V-9B, supporting `8K` context length.**
|
29 |
+
|
30 |
+
## Quick Start
|
31 |
+
|
32 |
+
```python
|
33 |
+
|
34 |
+
import torch
|
35 |
+
from PIL import Image
|
36 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
37 |
+
|
38 |
+
device = "cuda"
|
39 |
+
|
40 |
+
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4v-9b", trust_remote_code=True)
|
41 |
+
|
42 |
+
query = 'discribe this image'
|
43 |
+
image = Image.open("your image").convert('RGB')
|
44 |
+
inputs = tokenizer.apply_chat_template([{"role": "user", "image": image, "content": query}],
|
45 |
+
add_generation_prompt=True, tokenize=True, return_tensors="pt",
|
46 |
+
return_dict=True) # chat mode
|
47 |
+
|
48 |
+
inputs = inputs.to(device)
|
49 |
+
model = AutoModelForCausalLM.from_pretrained(
|
50 |
+
"THUDM/glm-4v-9b",
|
51 |
+
torch_dtype=torch.bfloat16,
|
52 |
+
low_cpu_mem_usage=True,
|
53 |
+
trust_remote_code=True
|
54 |
+
).to(device).eval()
|
55 |
+
|
56 |
+
gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}
|
57 |
+
with torch.no_grad():
|
58 |
+
outputs = model.generate(**inputs, **gen_kwargs)
|
59 |
+
outputs = outputs[:, inputs['input_ids'].shape[1]:]
|
60 |
+
print(tokenizer.decode(outputs[0]))
|
61 |
+
```
|
62 |
+
|
63 |
+
## License
|
64 |
+
|
65 |
+
Rhe use of the GLM-4 model weights needs to comply with the [LICENSE](LICENSE).
|
66 |
+
|
67 |
+
## Citation
|
68 |
+
|
69 |
+
If you find our work helpful, please consider citing the following papers.
|
70 |
+
|
71 |
+
```
|
72 |
+
@article{zeng2022glm,
|
73 |
+
title={Glm-130b: An open bilingual pre-trained model},
|
74 |
+
author={Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and others},
|
75 |
+
journal={arXiv preprint arXiv:2210.02414},
|
76 |
+
year={2022}
|
77 |
+
}
|
78 |
+
```
|
79 |
+
|
80 |
+
```
|
81 |
+
@inproceedings{du2022glm,
|
82 |
+
title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling},
|
83 |
+
author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie},
|
84 |
+
booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
|
85 |
+
pages={320--335},
|
86 |
+
year={2022}
|
87 |
+
}
|
88 |
+
```
|
89 |
+
|
90 |
+
```
|
91 |
+
@misc{wang2023cogvlm,
|
92 |
+
title={CogVLM: Visual Expert for Pretrained Language Models},
|
93 |
+
author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
|
94 |
+
year={2023},
|
95 |
+
eprint={2311.03079},
|
96 |
+
archivePrefix={arXiv},
|
97 |
+
primaryClass={cs.CV}
|
98 |
+
}
|
99 |
+
```
|
100 |
+
|