MiniMax-AI commited on
Commit
cfde609
·
1 Parent(s): 305d273

Initial Commit

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +1 -0
  2. LICENSE +42 -0
  3. README.md +158 -3
  4. added_tokens.json +28 -0
  5. apple.jpg +0 -0
  6. chat_template.json +3 -0
  7. config.json +261 -0
  8. configuration_minimax_text_01.py +152 -0
  9. configuration_minimax_vl_01.py +127 -0
  10. figures/MiniMaxLogo.png +0 -0
  11. figures/TextBench.png +0 -0
  12. figures/VisionBench.png +0 -0
  13. figures/hailuo.svg +1 -0
  14. figures/image.jpg +0 -0
  15. figures/minimax.svg +1 -0
  16. figures/niah.png +3 -0
  17. image_processor.py +616 -0
  18. main.py +124 -0
  19. merges.txt +0 -0
  20. model-00000-of-00414.safetensors +3 -0
  21. model-00001-of-00414.safetensors +3 -0
  22. model-00002-of-00414.safetensors +3 -0
  23. model-00003-of-00414.safetensors +3 -0
  24. model-00004-of-00414.safetensors +3 -0
  25. model-00005-of-00414.safetensors +3 -0
  26. model-00006-of-00414.safetensors +3 -0
  27. model-00007-of-00414.safetensors +3 -0
  28. model-00008-of-00414.safetensors +3 -0
  29. model-00009-of-00414.safetensors +3 -0
  30. model-00010-of-00414.safetensors +3 -0
  31. model-00011-of-00414.safetensors +3 -0
  32. model-00012-of-00414.safetensors +3 -0
  33. model-00013-of-00414.safetensors +3 -0
  34. model-00014-of-00414.safetensors +3 -0
  35. model-00015-of-00414.safetensors +3 -0
  36. model-00016-of-00414.safetensors +3 -0
  37. model-00017-of-00414.safetensors +3 -0
  38. model-00018-of-00414.safetensors +3 -0
  39. model-00019-of-00414.safetensors +3 -0
  40. model-00020-of-00414.safetensors +3 -0
  41. model-00021-of-00414.safetensors +3 -0
  42. model-00022-of-00414.safetensors +3 -0
  43. model-00023-of-00414.safetensors +3 -0
  44. model-00024-of-00414.safetensors +3 -0
  45. model-00025-of-00414.safetensors +3 -0
  46. model-00026-of-00414.safetensors +3 -0
  47. model-00027-of-00414.safetensors +3 -0
  48. model-00028-of-00414.safetensors +3 -0
  49. model-00029-of-00414.safetensors +3 -0
  50. model-00030-of-00414.safetensors +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ figures/niah.png filter=lfs diff=lfs merge=lfs -text
LICENSE ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ MINIMAX MODEL LICENSE AGREEMENT
3
+
4
+ 1. Definitions
5
+ "Agreement" means the terms and conditions for use, reproduction, distribution and modification of the MiniMax Model Materials set forth herein.
6
+ "License" or "you" means you, or your employer or any other person or entity (if you are entering into this Agreement on such person or entity’s behalf), of the age required under applicable laws, rules or regulations to provide legal consent and that has legal authority to bind your employer or such other person or entity if you are entering in this Agreement on their behalf.
7
+ "MiniMax Model" means the foundational large language models and software and algorithms, including machine-learning model code, trained model weights, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by MiniMax at https://huggingface.co/MiniMaxAI/MiniMaxText01, https://huggingface.co/MiniMaxAI/MiniMaxVL01, https://github.com/MiniMax-AI/MiniMax01. In this agreement, MiniMax Model including MiniMaxText01 and MiniMaxVL01.
8
+ "MiniMax Model Materials" means, collectively, MiniMax’s proprietary MiniMax Model and Documentation (and any portion thereof) made available under this Agreement.
9
+ "MiniMax" or "we" means MiniMax AI.
10
+
11
+ 2. License Rights and Redistribution
12
+ a. Grant of Rights. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under MiniMax’s intellectual property or other rights owned by MiniMax embodied in the MiniMax Model Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the MiniMax Model Materials.
13
+ b. Redistribution and Use.
14
+ i. If you distribute or make available the MiniMax Model Materials (or any derivative works thereof), or a product or service that uses any of them, including another AI model, you shall (A) provide a copy of this Agreement with any such MiniMax Model Materials; and (B) prominently display “Built with MiniMax AI” on a related website, user interface, blogpost, about page, or product documentation. If you use the MiniMax Model Materials to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include “MiniMax” at the beginning of any such AI model name.
15
+ ii. You must retain in all copies of the MiniMax Model Materials that you distribute the following attribution notice within a “Notice” text file distributed as a part of such copies: “MiniMax AI model is licensed under the MiniMax License, Copyright © MiniMax. All Rights Reserved.”
16
+ iii. Your use of the MiniMax Model Materials must comply with applicable laws and regulations (including trade compliance laws and regulations) and adhere to the Prohibited Uses Policy for the MiniMax Model Materials, which is hereby incorporated by reference into this Agreement.
17
+ iv. You will not use the MiniMax Model Materials or any output or results of the MiniMax Model Materials to improve any other large language model.
18
+
19
+ 3. Additional Commercial Terms. If, on the MiniMax Model Materials release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 100 million monthly active users in the preceding calendar month, you must request a license from MiniMax, which MiniMax may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until MiniMax otherwise expressly grants you such rights.
20
+
21
+ 4. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE MINIMAX MODEL MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF ANY KIND, AND MINIMAX DISCLAIMS ALL WARRANTIES OF ANY KIND, BOTH EXPRESS AND IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE MINIMAX MODEL MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE MINIMAX MODEL MATERIALS AND ANY OUTPUT AND RESULTS.
22
+
23
+ 5. Limitation of Liability. IN NO EVENT WILL MINIMAX OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF MINIMAX OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING.
24
+
25
+ 6. Intellectual Property.
26
+ a. No trademark licenses are granted under this Agreement, and in connection with the MiniMax Model Materials, neither MiniMax nor Licensee may use any name or mark owned by or associated with the other or any of its affiliates, except as required for reasonable and customary use in describing and redistributing the MiniMax Materials or as set forth in this Section 6(a). MiniMax hereby grants you a license to use "MiniMaxText01" or "MiniMaxVL01" (the "Mark") solely as required to comply with the last sentence of Section 2.b.i. All goodwill arising out of your use of the Mark will inure to the benefit of MiniMax.
27
+ b. Subject to MiniMax’s ownership of MiniMax Model Materials and derivatives made by or for MiniMax, with respect to any derivative works and modifications of the MiniMax Model Materials that are made by you, as between you and MiniMax, you are and will be the owner of such derivative works and modifications.
28
+ c. If you institute litigation or other proceedings against MiniMax or any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the MiniMax Model Materials or outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless MiniMax from and against any claim by any third party arising out of or related to your use or distribution of the MiniMax Model Materials.
29
+ 7. Term and Termination. The term of this Agreement will commence upon your acceptance of this Agreement or access to the MiniMax Model Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. MiniMax may terminate this Agreement if you are in breach of any term or condition of this Agreement. Upon termination of this Agreement, you shall delete and cease use of the MiniMax Model Materials. Sections 2, 3 and 6 shall survive the termination of this Agreement.
30
+
31
+ 8. Governing Law and Jurisdiction. This agreement will be governed and construed under the laws of Singapore without regard to choice of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this agreement. Any dispute arising out of or in connection with this Agreement, including any question regarding its existence, validity or termination, shall be referred to and finally resolved by arbitration administered by the Singapore International Arbitration Centre (“SIAC”) in accordance with the Arbitration Rules of the Singapore International Arbitration Centre (“SIAC Rules”) for the time being in force, which rules are deemed to be incorporated by reference in this clause.
32
+
33
+ You agree you will not use, or allow others to use,MiniMaxText01 or MiniMaxVL01 to:
34
+ 1. Violate any applicable federal, state, local, or international law or regulation, or infringe upon the lawful rights or interests of any third party.
35
+ 2. Assist with, engage in or in any way associate with any military purpose.
36
+ 3. Exploit, harm, or attempt to exploit or harm minors in any way.
37
+ 4. Generate or disseminate false or misleading information with the intent to harm others.
38
+ 5. Generate or disseminate content prohibited by applicable laws or regulations.
39
+ 6. Generate or disseminate personally identifiable information without proper authorization or for unreasonable or unlawful purposes.
40
+ 7. Defame, disparage, harass, or cause harm to any individual or entity.
41
+ 8. Carry out fully automated decision-making that adversely affects an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation.
42
+ 9. Promote discrimination, hate speech, or harmful behavior towards individuals or groups based on race or ethnic origin, religion, disability, age, nationality and national origin, veteran status, sexual orientation, gender or gender identity, caste, immigration status, or any other legally protected characteristics or categories.
README.md CHANGED
@@ -1,3 +1,158 @@
1
- ---
2
- license: unknown
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+ <img src="figures/MiniMaxLogo.png" width="60%" alt="MiniMax-Text-01" />
3
+ </div>
4
+ <hr>
5
+
6
+ <div align="center" style="line-height: 1;">
7
+ <a href="https://www.minimaxi.com/en" target="_blank" style="margin: 2px;">
8
+ <img alt="Homepage" src="https://img.shields.io/badge/_Homepage-MiniMax-FF4040?style=flat-square&labelColor=2C3E50&logo=&logoWidth=20" style="display: inline-block; vertical-align: middle;"/>
9
+ </a>
10
+ <a href="https://huggingface.co/MiniMaxAI" target="_blank" style="margin: 2px;">
11
+ <img alt="Hugging Face" src="https://img.shields.io/badge/🤗_Hugging_Face-MinMax-FF4040?style=flat-square&labelColor=2C3E50" style="display: inline-block; vertical-align: middle;"/>
12
+ </a>
13
+ </div>
14
+ <div align="center" style="line-height: 1;">
15
+ <a href="https://www.hailuo.ai/" target="_blank" style="margin: 2px;">
16
+ <img alt="Chat" src="https://img.shields.io/badge/Chat-_Hailuo AI-FF4040?style=flat-square&labelColor=2C3E50&logo=&logoWidth=16" style="display: inline-block; vertical-align: middle;"/>
17
+ </a>
18
+ <a href="https://intl.minimaxi.com" style="margin: 2px;">
19
+ <img alt="API" src="https://img.shields.io/badge/⚡_API-Platform-FF4040?style=flat-square&labelColor=2C3E50" style="display: inline-block; vertical-align: middle;"/>
20
+ </a>
21
+ </div>
22
+ <div align="center" style="line-height: 1;">
23
+ <a href="https://github.com/MiniMax-AI/MiniMax-01/blob/main/LICENSE" style="margin: 2px;">
24
+ <img alt="License" src="https://img.shields.io/badge/📜_License-Model_Agreement-FF4040?style=flat-square&labelColor=2C3E50" style="display: inline-block; vertical-align: middle;"/>
25
+ </a>
26
+ </div>
27
+
28
+ # MiniMax-VL-01
29
+
30
+ ## 1. Introduction
31
+ We are delighted to introduce our **MiniMax-VL-01** model. It adopts the “ViT-MLP-LLM” framework, which is a commonly used technique in the field of multimodal large language models. The model is initialized and trained with three key parts: a 303-million-parameter Vision Transformer (ViT) for visual encoding, a randomly initialized two-layer MLP projector for image adaptation, and the MiniMax-Text-01 as the base LLM.
32
+ MiniMax-VL-01 has a notable dynamic resolution feature. Input images are resized per a pre-set grid, with resolutions from 336×336 to 2016×2016, keeping a 336×336 thumbnail. The resized images are split into non-overlapping patches of the same size. These patches and the thumbnail are encoded separately and then combined for a full image representation.
33
+ The training data for MiniMax-VL-01 consists of caption, description, and instruction data. The Vision Transformer (ViT) is trained on 694 million image-caption pairs from scratch. Across four distinct stages of the training pipeline, a total of 512 billion tokens are processed, leveraging this vast amount of data to endow the model with strong capabilities.
34
+ Finally, MiniMax-VL-01 has reached top-level performance on multimodal leaderboards, demonstrating its edge and dependability in complex multimodal tasks.
35
+
36
+
37
+ <p align="center">
38
+ <img width="100%" src="figures/VisionBench.png">
39
+ </p>
40
+
41
+
42
+ ## 2. Evaluation
43
+
44
+ | Tasks | GPT-4o<br>(11-20) | Claude-3.5-Sonnet (10-22) | Gemini-1.5-Pro (002) | Gemini-2.0-Flash (exp) | Qwen2-VL-72B-Inst. | InternVL2.5-78B | LLama-3.2-90B | MiniMax-VL-01 |
45
+ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
46
+ | **Knowledge** | | | | | | | | |
47
+ | MMMU<sup>*</sup> | 63.5 | **72.0** | 68.4 | 70.6 | 64.5 | 66.5 | 62.1 | 68.5 |
48
+ | MMMU-Pro<sup>*</sup> | 54.5 | 54.7 | 50.9 | **57.0** | 43.2 | 47.3 | 36.0 | 52.7 |
49
+ | **Visual Q&A** | | | | | | | | |
50
+ | ChartQA<sup>*</sup><sub>relaxed</sub> | 88.1 | 90.8 | 88.7 | 88.3 | 91.2 | 91.5 | 85.5 | **91.7** |
51
+ | DocVQA<sup>*</sup> | 91.1 | 94.2 | 91.5 | 92.9 | **97.1** | 96.1 | 90.1 | 96.4 |
52
+ | OCRBench | 806 | 790 | 800 | 846 | 856 | 847 | 805 | **865** |
53
+ | **Mathematics & Sciences** || | | | | | | |
54
+ | AI2D<sup>*</sup> | 83.1 | 82.0 | 80.9 | 85.1 | 84.4 | **86.8** | 78.9 | 83.3 |
55
+ | MathVista<sup>*</sup> | 62.1 | 65.4 | 70.6 | **73.1** | 69.6 | 68.4 | 57.3 | 68.6 |
56
+ | OlympiadBench<sub>full</sub> | 25.2 | 28.4 | 32.1 | **46.1** | 21.9 | 25.1 | 19.3 | 24.2 |
57
+ |**Long Context**|||||
58
+ |M-LongDoc<sub>acc</sub>| **41.4** | 31.4 | 26.2 | 31.4 | 11.6 | 19.7 | 13.9 | 32.5 |
59
+ |**Comprehensive**|||||
60
+ |MEGA-Bench<sub>macro</sub> | 49.4 | 51.4 | 45.9 | **53.9** | 46.8 | 45.3 | 19.9 | 47.4 |
61
+ |**User Experience**|||||
62
+ |In-house Benchmark | 62.3 | 47.0 | 49.2 | **72.1** | 40.6 | 34.8 | 13.6 | 56.6 |
63
+
64
+ <sup>*</sup> Evaluated following a _0-shot CoT_ setting.
65
+
66
+
67
+ ## 3. Quickstart
68
+ Here we provide a simple example of loading the tokenizer and model to generate content.
69
+ ```python
70
+ from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig, QuantoConfig, GenerationConfig
71
+ import torch
72
+ import json
73
+ import os
74
+ from PIL import Image
75
+
76
+ # load hf config
77
+ hf_config = AutoConfig.from_pretrained("MiniMax-VL-01", trust_remote_code=True)
78
+
79
+ # quantization config, int8 is recommended
80
+ quantization_config = QuantoConfig(
81
+ weights="int8",
82
+ modules_to_not_convert=[
83
+ "vision_tower",
84
+ "image_newline",
85
+ "multi_modal_projector",
86
+ "lm_head",
87
+ "embed_tokens",
88
+ ] + [f"model.layers.{i}.coefficient" for i in range(hf_config.text_config.num_hidden_layers)]
89
+ + [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.text_config.num_hidden_layers)]
90
+ )
91
+
92
+ # set device map
93
+ model_safetensors_index_path = os.path.join("MiniMax-VL-01", "model.safetensors.index.json")
94
+ with open(model_safetensors_index_path, "r") as f:
95
+ model_safetensors_index = json.load(f)
96
+ weight_map = model_safetensors_index['weight_map']
97
+ vision_map = {}
98
+ for key, value in weight_map.items():
99
+ if 'vision_tower' in key or 'image_newline' in key or 'multi_modal_projector' in key:
100
+ new_key = key.replace('.weight','').replace('.bias','')
101
+ if new_key not in vision_map:
102
+ vision_map[new_key] = value
103
+ # assume 8 GPUs
104
+ world_size = 8
105
+ device_map = {
106
+ 'language_model.model.embed_tokens': 'cuda:0',
107
+ 'language_model.model.norm': f'cuda:{world_size - 1}',
108
+ 'language_model.lm_head': f'cuda:{world_size - 1}'
109
+ }
110
+ for key, value in vision_map.items():
111
+ device_map[key] = f'cuda:0'
112
+ device_map['vision_tower.vision_model.post_layernorm'] = f'cuda:0'
113
+ layers_per_device = hf_config.text_config.num_hidden_layers // world_size
114
+ for i in range(world_size):
115
+ for j in range(layers_per_device):
116
+ device_map[f'language_model.model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'
117
+
118
+ # load processor
119
+ processor = AutoProcessor.from_pretrained("MiniMax-VL-01", trust_remote_code=True)
120
+ messages = [
121
+ {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by MiniMax based on MiniMax-VL-01 model."}]},
122
+ {"role": "user", "content": [{"type": "image", "image": "placeholder"},{"type": "text", "text": "Describe this image."}]},
123
+ ]
124
+ prompt = processor.tokenizer.apply_chat_template(
125
+ messages, tokenize=False, add_generation_prompt=True
126
+ )
127
+ raw_image = Image.open("figures/image.jpg")
128
+ # tokenize and move to device
129
+ model_inputs = processor(images=[raw_image], text=prompt, return_tensors='pt').to('cuda').to(torch.bfloat16)
130
+
131
+ # load bfloat16 model, move to device, and apply quantization
132
+ quantized_model = AutoModelForCausalLM.from_pretrained(
133
+ "MiniMax-VL-01",
134
+ torch_dtype="bfloat16",
135
+ device_map=device_map,
136
+ quantization_config=quantization_config,
137
+ trust_remote_code=True,
138
+ offload_buffers=True,
139
+ )
140
+ generation_config = GenerationConfig(
141
+ max_new_tokens=100,
142
+ eos_token_id=200020,
143
+ use_cache=True,
144
+ )
145
+
146
+ # generate response
147
+ generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
148
+ print(f"generated_ids: {generated_ids}")
149
+ generated_ids = [
150
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
151
+ ]
152
+ response = processor.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
153
+ ```
154
+
155
+ ## 4. Chatbot & API
156
+ For general use and evaluation, we provide a [Chatbot](https://www.hailuo.ai/) with online search capabilities and the [online API](https://intl.minimaxi.com) for developers.
157
+
158
+ Contact us at [model@minimaxi.com](mailto:model@minimaxi.com).
added_tokens.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "<code_interpreter>": 200023,
3
+ "<commit_after>": 200018,
4
+ "<commit_before>": 200016,
5
+ "<commit_msg>": 200017,
6
+ "<empty_output>": 200015,
7
+ "<filename>": 200006,
8
+ "<fim_middle>": 200002,
9
+ "<fim_pad>": 200004,
10
+ "<fim_prefix>": 200001,
11
+ "<fim_suffix>": 200003,
12
+ "<function_call>": 200022,
13
+ "<gh_stars>": 200007,
14
+ "<speech>[": 200024,
15
+ "<image>[": 200025,
16
+ "<issue_closed>": 200010,
17
+ "<issue_comment>": 200009,
18
+ "<issue_start>": 200008,
19
+ "<jupyter_code>": 200013,
20
+ "<jupyter_output>": 200014,
21
+ "<jupyter_start>": 200011,
22
+ "<jupyter_text>": 200012,
23
+ "<reponame>": 200005,
24
+ "[e~[": 200020,
25
+ "]!d~[": 200021,
26
+ "]!p~[": 200000,
27
+ "]~b]": 200019
28
+ }
apple.jpg ADDED
chat_template.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "chat_template": "{% for message in messages %}{% if message['role'] == 'system' %}{{ '<beginning_of_sentence>system ai_setting=assistant\n' }}{% for item in message['content'] %}{% if item.type == 'image' %}<image>{% elif item.type == 'text' %}{{ item.text }}{% endif %}{% endfor %}{{ '<end_of_sentence>\n' }}{% endif %}{% if message['role'] == 'assistant' %}{{ '<beginning_of_sentence>ai name=assistant\n' }}{% for item in message['content'] %}{% if item.type == 'image' %}<image>{% elif item.type == 'text' %}{{ item.text }}{% endif %}{% endfor %}{{ '<end_of_sentence>\n' }}{% endif %}{% if message['role'] == 'user' %}{{ '<beginning_of_sentence>user name=user\n' }}{% for item in message['content'] %}{% if item.type == 'image' %}<image>{% elif item.type == 'text' %}{{ item.text }}{% endif %}{% endfor %}{{ '<end_of_sentence>\n' }}{% endif %}{% endfor %}{{ '<beginning_of_sentence>ai name=assistant\n' }}"
3
+ }
config.json ADDED
@@ -0,0 +1,261 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "MiniMaxVL01ForConditionalGeneration"
4
+ ],
5
+ "auto_map": {
6
+ "AutoModelForCausalLM": "modeling_minimax_vl_01.MiniMaxVL01ForConditionalGeneration",
7
+ "AutoConfig": "configuration_minimax_vl_01.MiniMaxVL01Config"
8
+ },
9
+ "ignore_index": -100,
10
+ "image_grid_pinpoints": [
11
+ [
12
+ 336,
13
+ 336
14
+ ],
15
+ [
16
+ 336,
17
+ 672
18
+ ],
19
+ [
20
+ 336,
21
+ 1008
22
+ ],
23
+ [
24
+ 336,
25
+ 1344
26
+ ],
27
+ [
28
+ 336,
29
+ 1680
30
+ ],
31
+ [
32
+ 336,
33
+ 2016
34
+ ],
35
+ [
36
+ 672,
37
+ 336
38
+ ],
39
+ [
40
+ 672,
41
+ 672
42
+ ],
43
+ [
44
+ 672,
45
+ 1008
46
+ ],
47
+ [
48
+ 672,
49
+ 1344
50
+ ],
51
+ [
52
+ 672,
53
+ 1680
54
+ ],
55
+ [
56
+ 672,
57
+ 2016
58
+ ],
59
+ [
60
+ 1008,
61
+ 336
62
+ ],
63
+ [
64
+ 1008,
65
+ 672
66
+ ],
67
+ [
68
+ 1008,
69
+ 1008
70
+ ],
71
+ [
72
+ 1008,
73
+ 1344
74
+ ],
75
+ [
76
+ 1008,
77
+ 1680
78
+ ],
79
+ [
80
+ 1008,
81
+ 2016
82
+ ],
83
+ [
84
+ 1344,
85
+ 336
86
+ ],
87
+ [
88
+ 1344,
89
+ 672
90
+ ],
91
+ [
92
+ 1344,
93
+ 1008
94
+ ],
95
+ [
96
+ 1344,
97
+ 1344
98
+ ],
99
+ [
100
+ 1680,
101
+ 336
102
+ ],
103
+ [
104
+ 1680,
105
+ 672
106
+ ],
107
+ [
108
+ 1680,
109
+ 1008
110
+ ],
111
+ [
112
+ 2016,
113
+ 336
114
+ ],
115
+ [
116
+ 2016,
117
+ 672
118
+ ],
119
+ [
120
+ 2016,
121
+ 1008
122
+ ]
123
+ ],
124
+ "image_token_index": 200025,
125
+ "model_type": "minimax_vl_01",
126
+ "projector_hidden_act": "gelu",
127
+ "text_config": {
128
+ "architectures": [
129
+ "MiniMaxText01ForCausalLM"
130
+ ],
131
+ "attn_type_list": [
132
+ 0,
133
+ 0,
134
+ 0,
135
+ 0,
136
+ 0,
137
+ 0,
138
+ 0,
139
+ 1,
140
+ 0,
141
+ 0,
142
+ 0,
143
+ 0,
144
+ 0,
145
+ 0,
146
+ 0,
147
+ 1,
148
+ 0,
149
+ 0,
150
+ 0,
151
+ 0,
152
+ 0,
153
+ 0,
154
+ 0,
155
+ 1,
156
+ 0,
157
+ 0,
158
+ 0,
159
+ 0,
160
+ 0,
161
+ 0,
162
+ 0,
163
+ 1,
164
+ 0,
165
+ 0,
166
+ 0,
167
+ 0,
168
+ 0,
169
+ 0,
170
+ 0,
171
+ 1,
172
+ 0,
173
+ 0,
174
+ 0,
175
+ 0,
176
+ 0,
177
+ 0,
178
+ 0,
179
+ 1,
180
+ 0,
181
+ 0,
182
+ 0,
183
+ 0,
184
+ 0,
185
+ 0,
186
+ 0,
187
+ 1,
188
+ 0,
189
+ 0,
190
+ 0,
191
+ 0,
192
+ 0,
193
+ 0,
194
+ 0,
195
+ 1,
196
+ 0,
197
+ 0,
198
+ 0,
199
+ 0,
200
+ 0,
201
+ 0,
202
+ 0,
203
+ 1,
204
+ 0,
205
+ 0,
206
+ 0,
207
+ 0,
208
+ 0,
209
+ 0,
210
+ 0,
211
+ 1
212
+ ],
213
+ "bos_token_id": null,
214
+ "eos_token_id": null,
215
+ "head_dim": 128,
216
+ "hidden_size": 6144,
217
+ "intermediate_size": 9216,
218
+ "layernorm_full_attention_alpha": 3.5565588200778455,
219
+ "layernorm_full_attention_beta": 1.0,
220
+ "layernorm_linear_attention_alpha": 3.5565588200778455,
221
+ "layernorm_linear_attention_beta": 1.0,
222
+ "layernorm_mlp_alpha": 3.5565588200778455,
223
+ "layernorm_mlp_beta": 1.0,
224
+ "max_position_embeddings": 8192,
225
+ "model_type": "minimax_text_01",
226
+ "num_attention_heads": 64,
227
+ "num_experts_per_tok": 2,
228
+ "num_hidden_layers": 80,
229
+ "num_key_value_heads": 8,
230
+ "num_local_experts": 32,
231
+ "postnorm": true,
232
+ "rms_norm_eps": 1e-05,
233
+ "rope_theta": 10000000,
234
+ "rotary_dim": 64,
235
+ "shared_intermediate_size": [
236
+ 0
237
+ ],
238
+ "shared_moe_mode": "sigmoid",
239
+ "vocab_size": 200064
240
+ },
241
+ "transformers_version": "4.42.3",
242
+ "vision_config": {
243
+ "auto_map": {
244
+ "AutoModel": "modeling_clip.CLIPVisionModel"
245
+ },
246
+ "hidden_act": "gelu",
247
+ "hidden_size": 1024,
248
+ "image_size": 336,
249
+ "intermediate_size": 4096,
250
+ "model_type": "clip_vision_model",
251
+ "num_attention_heads": 16,
252
+ "num_hidden_layers": 24,
253
+ "patch_size": 14,
254
+ "projection_dim": 6144,
255
+ "vocab_size": 32000
256
+ },
257
+ "torch_dtype": "bfloat16",
258
+ "vision_feature_layer": -1,
259
+ "vision_feature_select_strategy": "default"
260
+ }
261
+
configuration_minimax_text_01.py ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """ MiniMaxText01 model configuration"""
2
+
3
+ from transformers.configuration_utils import PretrainedConfig
4
+ from transformers.utils import logging
5
+
6
+
7
+ logger = logging.get_logger(__name__)
8
+
9
+
10
+ class MiniMaxText01Config(PretrainedConfig):
11
+ r"""
12
+ This is the configuration class to store the configuration of a [`MiniMaxText01Model`]. It is used to instantiate an
13
+ MiniMaxText01 model according to the specified arguments, defining the model architecture. Instantiating a configuration
14
+ with the defaults will yield a similar configuration to that of the MiniMaxText01.
15
+
16
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
17
+ documentation from [`PretrainedConfig`] for more information.
18
+
19
+
20
+ Args:
21
+ vocab_size (`int`, *optional*, defaults to 32000):
22
+ Vocabulary size of the MiniMaxText01 model. Defines the number of different tokens that can be represented by the
23
+ `inputs_ids` passed when calling [`MiniMaxText01Model`]
24
+ hidden_size (`int`, *optional*, defaults to 4096):
25
+ Dimension of the hidden representations.
26
+ intermediate_size (`int`, *optional*, defaults to 14336):
27
+ Dimension of the MLP representations.
28
+ num_hidden_layers (`int`, *optional*, defaults to 32):
29
+ Number of hidden layers in the Transformer encoder.
30
+ num_attention_heads (`int`, *optional*, defaults to 32):
31
+ Number of attention heads for each attention layer in the Transformer encoder.
32
+ num_key_value_heads (`int`, *optional*, defaults to 8):
33
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
34
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
35
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
36
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
37
+ by meanpooling all the original heads within that group. For more details checkout [this
38
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to `8`.
39
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
40
+ The non-linear activation function (function or string) in the decoder.
41
+ max_position_embeddings (`int`, *optional*, defaults to `4096*32`):
42
+ The maximum sequence length that this model might ever be used with. MiniMaxText01's sliding window attention
43
+ allows sequence of up to 4096*32 tokens.
44
+ initializer_range (`float`, *optional*, defaults to 0.02):
45
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
46
+ rms_norm_eps (`float`, *optional*, defaults to 1e-05):
47
+ The epsilon used by the rms normalization layers.
48
+ use_cache (`bool`, *optional*, defaults to `True`):
49
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
50
+ relevant if `config.is_decoder=True`.
51
+ pad_token_id (`int`, *optional*):
52
+ The id of the padding token.
53
+ bos_token_id (`int`, *optional*, defaults to 1):
54
+ The id of the "beginning-of-sequence" token.
55
+ eos_token_id (`int`, *optional*, defaults to 2):
56
+ The id of the "end-of-sequence" token.
57
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
58
+ Whether the model's input and output word embeddings should be tied.
59
+ rope_theta (`float`, *optional*, defaults to 1000000.0):
60
+ The base period of the RoPE embeddings.
61
+ sliding_window (`int`, *optional*):
62
+ Sliding window attention window size. If not specified, will default to `4096`.
63
+ attention_dropout (`float`, *optional*, defaults to 0.0):
64
+ The dropout ratio for the attention probabilities.
65
+ num_experts_per_tok (`int`, *optional*, defaults to 2):
66
+ The number of experts to route per-token, can be also interpreted as the `top-k` routing
67
+ parameter
68
+ num_local_experts (`int`, *optional*, defaults to 8):
69
+ Number of experts per Sparse MLP layer.
70
+ output_router_logits (`bool`, *optional*, defaults to `False`):
71
+ Whether or not the router logits should be returned by the model. Enabeling this will also
72
+ allow the model to output the auxiliary loss. See [here]() for more details
73
+ router_aux_loss_coef (`float`, *optional*, defaults to 0.001):
74
+ The aux loss factor for the total loss.
75
+ router_jitter_noise (`float`, *optional*, defaults to 0.0):
76
+ Amount of noise to add to the router.
77
+
78
+ ```python
79
+ >>> from transformers import MiniMaxText01Model, MiniMaxText01Config
80
+
81
+ >>> # Initializing a MiniMaxText01 style configuration
82
+ >>> configuration = MiniMaxText01Config()
83
+
84
+ >>> # Initializing a model from the MiniMaxText01 style configuration
85
+ >>> model = MiniMaxText01Model(configuration)
86
+
87
+ >>> # Accessing the model configuration
88
+ >>> configuration = model.config
89
+ ```"""
90
+
91
+ model_type = "MiniMaxText01"
92
+ keys_to_ignore_at_inference = ["past_key_values"]
93
+
94
+ def __init__(
95
+ self,
96
+ vocab_size=32000,
97
+ hidden_size=4096,
98
+ intermediate_size=14336,
99
+ num_hidden_layers=32,
100
+ num_attention_heads=32,
101
+ num_key_value_heads=8,
102
+ hidden_act="silu",
103
+ max_position_embeddings=4096 * 32,
104
+ initializer_range=0.02,
105
+ rms_norm_eps=1e-5,
106
+ use_cache=True,
107
+ pad_token_id=None,
108
+ bos_token_id=None,
109
+ eos_token_id=None,
110
+ tie_word_embeddings=False,
111
+ rope_theta=1e6,
112
+ sliding_window=None,
113
+ attention_dropout=0.0,
114
+ num_experts_per_tok=2,
115
+ num_local_experts=8,
116
+ output_router_logits=False,
117
+ router_aux_loss_coef=0.001,
118
+ router_jitter_noise=0.0,
119
+ **kwargs,
120
+ ):
121
+ self.vocab_size = vocab_size
122
+ self.max_position_embeddings = max_position_embeddings
123
+ self.hidden_size = hidden_size
124
+ self.intermediate_size = intermediate_size
125
+ self.num_hidden_layers = num_hidden_layers
126
+ self.num_attention_heads = num_attention_heads
127
+ self.sliding_window = sliding_window
128
+
129
+ # for backward compatibility
130
+ if num_key_value_heads is None:
131
+ num_key_value_heads = num_attention_heads
132
+
133
+ self.num_key_value_heads = num_key_value_heads
134
+ self.hidden_act = hidden_act
135
+ self.initializer_range = initializer_range
136
+ self.rms_norm_eps = rms_norm_eps
137
+ self.use_cache = use_cache
138
+ self.rope_theta = rope_theta
139
+ self.attention_dropout = attention_dropout
140
+
141
+ self.num_experts_per_tok = num_experts_per_tok
142
+ self.num_local_experts = num_local_experts
143
+ self.output_router_logits = output_router_logits
144
+ self.router_aux_loss_coef = router_aux_loss_coef
145
+ self.router_jitter_noise = router_jitter_noise
146
+ super().__init__(
147
+ pad_token_id=pad_token_id,
148
+ bos_token_id=bos_token_id,
149
+ eos_token_id=eos_token_id,
150
+ tie_word_embeddings=tie_word_embeddings,
151
+ **kwargs,
152
+ )
configuration_minimax_vl_01.py ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """MiniMaxVL01 model configuration"""
2
+
3
+ from transformers.configuration_utils import PretrainedConfig
4
+ from transformers.utils import logging
5
+ from transformers.models.auto import CONFIG_MAPPING, AutoConfig
6
+ from .configuration_minimax_text_01 import MiniMaxText01Config
7
+
8
+
9
+ class MiniMaxVL01Config(PretrainedConfig):
10
+ r"""
11
+ This is the configuration class to store the configuration of a [`MiniMaxVL01ForConditionalGeneration`]. It is used to instantiate an
12
+ MiniMaxVL01 model according to the specified arguments, defining the model architecture. Instantiating a configuration
13
+ with the defaults will yield a similar configuration to that of the MiniMaxVL01.
14
+
15
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
16
+ documentation from [`PretrainedConfig`] for more information.
17
+
18
+ Args:
19
+ vision_config (`Union[AutoConfig, dict]`, *optional*, defaults to `CLIPVisionConfig`):
20
+ The config object or dictionary of the vision backbone.
21
+ text_config (`Union[AutoConfig, dict]`, *optional*, defaults to `MiniMaxText01Config`):
22
+ The config object or dictionary of the text backbone.
23
+ ignore_index (`int`, *optional*, defaults to -100):
24
+ The ignore index for the loss function.
25
+ image_token_index (`int`, *optional*, defaults to 32000):
26
+ The image token index to encode the image prompt.
27
+ projector_hidden_act (`str`, *optional*, defaults to `"gelu"`):
28
+ The activation function used by the multimodal projector.
29
+ vision_feature_select_strategy (`str`, *optional*, defaults to `"default"`):
30
+ The feature selection strategy used to select the vision feature from the vision backbone.
31
+ Can be one of `"default"` or `"full"`. If `"default"`, the CLS token is removed from the vision features.
32
+ If `"full"`, the full vision features are used.
33
+ vision_feature_layer (`int`, *optional*, defaults to -2):
34
+ The index of the layer to select the vision feature.
35
+ image_grid_pinpoints (`List`, *optional*, defaults to `[[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008]]`):
36
+ A list of possible resolutions to use for processing high resolution images. Each item in the list should be a tuple or list
37
+ of the form `(height, width)`.
38
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
39
+ Whether the model's input and output word embeddings should be tied.
40
+ image_seq_length (`int`, *optional*, defaults to 576):
41
+ Sequence length of one image embedding.
42
+
43
+ Example:
44
+
45
+ ```python
46
+ >>> from transformers import MiniMaxVL01ForConditionalGeneration, MiniMaxVL01Config, CLIPVisionConfig, MiniMaxText01Config
47
+
48
+ >>> # Initializing a CLIP-vision config
49
+ >>> vision_config = CLIPVisionConfig()
50
+
51
+ >>> # Initializing a MiniMaxText01 config
52
+ >>> text_config = MiniMaxText01Config()
53
+
54
+ >>> # Initializing a MiniMaxVL01 style configuration
55
+ >>> configuration = MiniMaxVL01Config(vision_config, text_config)
56
+
57
+ >>> # Initializing a model from the MiniMaxVL01 style configuration
58
+ >>> model = MiniMaxVL01ForConditionalGeneration(configuration)
59
+
60
+ >>> # Accessing the model configuration
61
+ >>> configuration = model.config
62
+ ```"""
63
+
64
+ model_type = "minimax_vl_01"
65
+
66
+ def __init__(
67
+ self,
68
+ vision_config=None,
69
+ text_config=None,
70
+ ignore_index=-100,
71
+ image_token_index=32000,
72
+ projector_hidden_act="gelu",
73
+ vision_feature_select_strategy="default",
74
+ vision_feature_layer=-2,
75
+ image_grid_pinpoints=None,
76
+ tie_word_embeddings=False,
77
+ image_seq_length=576,
78
+ **kwargs,
79
+ ):
80
+ self.ignore_index = ignore_index
81
+ self.image_token_index = image_token_index
82
+ self.projector_hidden_act = projector_hidden_act
83
+ self.image_seq_length = image_seq_length
84
+
85
+ if vision_feature_select_strategy not in ["default", "full"]:
86
+ raise ValueError(
87
+ "vision_feature_select_strategy should be one of 'default', 'full'."
88
+ f"Got: {vision_feature_select_strategy}"
89
+ )
90
+
91
+ self.vision_feature_select_strategy = vision_feature_select_strategy
92
+ self.vision_feature_layer = vision_feature_layer
93
+ image_grid_pinpoints = (
94
+ image_grid_pinpoints
95
+ if image_grid_pinpoints is not None
96
+ else [[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008]]
97
+ )
98
+ self.image_grid_pinpoints = image_grid_pinpoints
99
+
100
+ if isinstance(vision_config, dict):
101
+ vision_config["model_type"] = (
102
+ vision_config["model_type"] if "model_type" in vision_config else "clip_vision_model"
103
+ )
104
+ vision_config = CONFIG_MAPPING[vision_config["model_type"]](**vision_config)
105
+ elif vision_config is None:
106
+ vision_config = CONFIG_MAPPING["clip_vision_model"](
107
+ intermediate_size=4096,
108
+ hidden_size=1024,
109
+ patch_size=14,
110
+ image_size=336,
111
+ num_hidden_layers=24,
112
+ num_attention_heads=16,
113
+ vocab_size=32000,
114
+ projection_dim=768,
115
+ )
116
+
117
+ self.vision_config = vision_config
118
+
119
+ if text_config is not None:
120
+ assert "model_type" in text_config, "text_config model_type is not specified"
121
+ text_config = MiniMaxText01Config(**text_config)
122
+ else:
123
+ text_config = MiniMaxText01Config()
124
+
125
+ self.text_config = text_config
126
+
127
+ super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
figures/MiniMaxLogo.png ADDED
figures/TextBench.png ADDED
figures/VisionBench.png ADDED
figures/hailuo.svg ADDED
figures/image.jpg ADDED
figures/minimax.svg ADDED
figures/niah.png ADDED

Git LFS Details

  • SHA256: 73fbd47b590198dad0ea6be7c45c35ce738a2978deb893c842721f0f0cf02eb8
  • Pointer size: 132 Bytes
  • Size of remote file: 1.47 MB
image_processor.py ADDED
@@ -0,0 +1,616 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
2
+ from typing import Optional, Union, Tuple, Dict, List, Iterable
3
+ from transformers.image_transforms import to_channel_dimension_format, PaddingMode
4
+ from transformers.image_utils import ChannelDimension, to_numpy_array, make_list_of_images, get_image_size, infer_channel_dimension_format
5
+ from transformers.utils import TensorType
6
+ from PIL import Image
7
+ import numpy as np
8
+ try:
9
+ from torchvision.transforms import InterpolationMode
10
+ BICUBIC = InterpolationMode.BICUBIC
11
+ except ImportError:
12
+ BICUBIC = Image.BICUBIC
13
+
14
+ import torch
15
+ from transformers.utils import (
16
+ TensorType,
17
+ is_torch_device,
18
+ is_torch_dtype,
19
+ requires_backends,
20
+ )
21
+
22
+ from torchvision.transforms import Compose, ToTensor, Normalize, ToPILImage, RandomResizedCrop, Resize
23
+
24
+ try:
25
+ from torchvision.transforms import InterpolationMode
26
+ BICUBIC = InterpolationMode.BICUBIC
27
+ except ImportError:
28
+ BICUBIC = Image.BICUBIC
29
+
30
+ from PIL import Image
31
+ import torch
32
+ import numpy as np
33
+ import os
34
+ processor_for_vllm = int(os.getenv("PROCESSOR_FOR_VLLM", 0))
35
+
36
+ def select_best_resolution(original_size, possible_resolutions):
37
+ """
38
+ Selects the best resolution from a list of possible resolutions based on the original size.
39
+
40
+ Args:
41
+ original_size (tuple): The original size of the image in the format (width, height).
42
+ possible_resolutions (list): A list of possible resolutions in the format [(width1, height1), (width2, height2), ...].
43
+
44
+ Returns:
45
+ tuple: The best fit resolution in the format (width, height).
46
+ """
47
+ original_width, original_height = original_size
48
+ best_fit = None
49
+ max_effective_resolution = 0
50
+ min_wasted_resolution = float("inf")
51
+
52
+ for width, height in possible_resolutions:
53
+ # Calculate the downscaled size to keep the aspect ratio
54
+ scale = min(width / original_width, height / original_height)
55
+ downscaled_width, downscaled_height = int(original_width * scale), int(original_height * scale)
56
+
57
+ # Calculate effective and wasted resolutions
58
+ effective_resolution = min(downscaled_width * downscaled_height, original_width * original_height)
59
+ wasted_resolution = (width * height) - effective_resolution
60
+
61
+ if effective_resolution > max_effective_resolution or (effective_resolution == max_effective_resolution and wasted_resolution < min_wasted_resolution):
62
+ max_effective_resolution = effective_resolution
63
+ min_wasted_resolution = wasted_resolution
64
+ best_fit = (width, height)
65
+
66
+ return best_fit
67
+
68
+ def divide_to_patches(image, patch_size):
69
+ """
70
+ Divides an image into patches of a specified size.
71
+
72
+ Args:
73
+ image (PIL.Image.Image): The input image.
74
+ patch_size (int): The size of each patch.
75
+
76
+ Returns:
77
+ list: A list of PIL.Image.Image objects representing the patches.
78
+ """
79
+ patches = []
80
+ width, height = image.size
81
+ for i in range(0, height, patch_size):
82
+ for j in range(0, width, patch_size):
83
+ box = (j, i, j + patch_size, i + patch_size)
84
+ patch = image.crop(box)
85
+ patches.append(patch)
86
+
87
+ return patches
88
+
89
+ def image_size_to_num_patches(image_size, grid_pinpoints, patch_size):
90
+ if not isinstance(grid_pinpoints, list):
91
+ raise TypeError("grid_pinpoints should be a list of tuples or lists")
92
+
93
+ # ! VERY IMPORTANT if image_size is tensor, must convert to into tuple, otherwise it will cause wrong calculate
94
+ if not isinstance(image_size, (list, tuple)):
95
+ if not isinstance(image_size, (torch.Tensor, np.ndarray)):
96
+ raise TypeError(f"image_size invalid type {type(image_size)} with value {image_size}")
97
+ image_size = image_size.tolist()
98
+
99
+ best_resolution = select_best_resolution(image_size, grid_pinpoints)
100
+ width, height = best_resolution
101
+ num_patches = 0
102
+ # consider change to ceil(height/patch_size)*ceil(width/patch_size) + 1
103
+ for i in range(0, height, patch_size):
104
+ for j in range(0, width, patch_size):
105
+ num_patches += 1
106
+ # add the base patch
107
+ num_patches += 1
108
+ return num_patches
109
+
110
+ def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size):
111
+ """
112
+ Calculate the shape of the image patch grid after the preprocessing for images of any resolution.
113
+
114
+ Args:
115
+ image_size (`tuple`):
116
+ The size of the input image in the format (width, height).
117
+ grid_pinpoints (`List`):
118
+ A list containing possible resolutions. Each item in the list should be a tuple or list
119
+ of the form `(height, width)`.
120
+ patch_size (`int`):
121
+ The size of each image patch.
122
+
123
+ Returns:
124
+ tuple: The shape of the image patch grid in the format (width, height).
125
+ """
126
+ if not isinstance(grid_pinpoints, list):
127
+ raise TypeError("grid_pinpoints should be a list of tuples or lists")
128
+
129
+ # ! VERY IMPORTANT if image_size is tensor, must convert to into tuple, otherwise it will cause wrong calculate
130
+ if not isinstance(image_size, (list, tuple)):
131
+ if not isinstance(image_size, (torch.Tensor, np.ndarray)):
132
+ raise TypeError(
133
+ f"image_size invalid type: {type(image_size)} not valid, should be either list, tuple, np.ndarray or tensor"
134
+ )
135
+ image_size = image_size.tolist()
136
+
137
+ width, height = select_best_resolution(image_size, grid_pinpoints)
138
+ return width // patch_size, height // patch_size
139
+
140
+
141
+ # custom transform
142
+ class KeeyRatioResize(object):
143
+ def __init__(self, size):
144
+ self.size = size
145
+
146
+ def __call__(self, image):
147
+ return keepratio_resize(image, self.size)
148
+
149
+ def keepratio_resize(image, size, return_scale=False):
150
+ # Resize the image to keep the ratio
151
+ w, h = image.size
152
+ resized_w, resized_h = size
153
+ if w / h > resized_w / resized_h:
154
+ # resize and pad to the right and left
155
+ new_h = int(resized_w*h/w)
156
+ resized_image = image.resize((resized_w, new_h), Image.BICUBIC)
157
+
158
+ image = Image.new('RGB', (resized_w, resized_h), (0, 0, 0))
159
+ pad_h = (resized_h - new_h) // 2
160
+ image.paste(resized_image, (0, pad_h))
161
+ scale = resized_w / w
162
+ #image.paste(resized_image, (0, 0))
163
+ else:
164
+ # resize and pad to the top and bottom
165
+ new_w = int(resized_h*w/h)
166
+ resized_image = image.resize((new_w, resized_h), Image.BICUBIC)
167
+ image = Image.new('RGB', (resized_w, resized_h), (0, 0, 0))
168
+ #image.paste(resized_image, (0, 0))
169
+ pad_w = (resized_w - new_w) // 2
170
+ image.paste(resized_image, (pad_w, 0))
171
+ scale = resized_h / h
172
+ if return_scale:
173
+ return image, scale
174
+ return image
175
+
176
+ def _convert_image_to_rgb(image):
177
+ return image.convert("RGB")
178
+
179
+ def _transform(img_h, img_w, image_mean=(0.48145466, 0.4578275, 0.40821073), image_std=(0.26862954, 0.26130258, 0.27577711)):
180
+ return Compose([
181
+ # ToPILImage(),
182
+ #RandomResizedCrop((img_h, img_w), scale=(0.5, 1.0), interpolation=BICUBIC),
183
+ #Resize((img_h, img_w), interpolation=BICUBIC),
184
+ _convert_image_to_rgb,
185
+ ToTensor(),
186
+ Normalize(image_mean, image_std),
187
+ ])
188
+
189
+
190
+ def get_hw_multiple_of(image_size, multiple, max_size=None):
191
+ w, h = image_size
192
+ new_w = w if w % multiple == 0 else w + (multiple - w % multiple)
193
+ new_h = h if h % multiple == 0 else h + (multiple - h % multiple)
194
+ if max_size is not None:
195
+ assert isinstance(max_size, (list, tuple)) and len(max_size) == 2
196
+ max_w, max_h = max_size
197
+ assert max_w % multiple == 0 and max_h % multiple == 0
198
+ if new_w > max_w or new_h > max_h:
199
+ # ratio = min(max_w / new_w, max_h / new_h)
200
+ # new_w = int(new_w * ratio)
201
+ # new_h = int(new_h * ratio)
202
+ new_w = min((new_w * max_w) // new_w, (new_w * max_h) // new_h)
203
+ new_h = min((new_h * max_w) // new_w, (new_h * max_h) // new_h)
204
+
205
+ new_w = new_w if new_w % multiple == 0 else new_w + (multiple - new_w % multiple)
206
+ new_h = new_h if new_h % multiple == 0 else new_h + (multiple - new_h % multiple)
207
+ assert new_w % multiple == 0 and new_h % multiple == 0
208
+ assert new_w <= max_w and new_h <= max_h
209
+ return new_w, new_h
210
+
211
+ def resize_multiple_of(image, multiple, max_size=None):
212
+ """
213
+ Resize the image to the multiple of a number.
214
+
215
+ Args:
216
+ image (PIL.Image.Image): The input image.
217
+ multiple (int): The number to which the image should be resized.
218
+
219
+ Returns:
220
+ PIL.Image.Image: The resized image.
221
+ """
222
+ width, height = image.size
223
+ new_width, new_height = get_hw_multiple_of((width, height), multiple, max_size)
224
+ return image.resize((new_width, new_height), Image.BICUBIC)
225
+
226
+
227
+
228
+ class CustomBatchFeature(BatchFeature):
229
+ def convert_to_tensors(self, tensor_type: Optional[Union[str, TensorType]] = None):
230
+ """
231
+ Convert the inner content to tensors.
232
+
233
+ Args:
234
+ tensor_type (`str` or [`~utils.TensorType`], *optional*):
235
+ The type of tensors to use. If `str`, should be one of the values of the enum [`~utils.TensorType`]. If
236
+ `None`, no modification is done.
237
+ """
238
+ if tensor_type is None:
239
+ return self
240
+
241
+ is_tensor, as_tensor = self._get_is_as_tensor_fns(tensor_type)
242
+
243
+ # Do the tensor conversion in batch
244
+ for key, value in self.items():
245
+ if key == "pixel_values":
246
+ for i, image in enumerate(value):
247
+ if not is_tensor(image):
248
+ tensor = as_tensor(image)
249
+ self[key][i] = tensor
250
+ continue
251
+ try:
252
+ if not is_tensor(value):
253
+ tensor = as_tensor(value)
254
+
255
+ self[key] = tensor
256
+ except: # noqa E722
257
+ if key == "overflowing_values":
258
+ raise ValueError("Unable to create tensor returning overflowing values of different lengths. ")
259
+ raise ValueError(
260
+ "Unable to create tensor, you should probably activate padding "
261
+ "with 'padding=True' to have batched tensors with the same length."
262
+ )
263
+
264
+ return self
265
+
266
+ def to(self, *args, **kwargs) -> "BatchFeature":
267
+ """
268
+ Send all values to device by calling `v.to(*args, **kwargs)` (PyTorch only). This should support casting in
269
+ different `dtypes` and sending the `BatchFeature` to a different `device`.
270
+
271
+ Args:
272
+ args (`Tuple`):
273
+ Will be passed to the `to(...)` function of the tensors.
274
+ kwargs (`Dict`, *optional*):
275
+ Will be passed to the `to(...)` function of the tensors.
276
+
277
+ Returns:
278
+ [`BatchFeature`]: The same instance after modification.
279
+ """
280
+ requires_backends(self, ["torch"])
281
+ import torch # noqa
282
+
283
+ new_data = {}
284
+ device = kwargs.get("device")
285
+ # Check if the args are a device or a dtype
286
+ if device is None and len(args) > 0:
287
+ # device should be always the first argument
288
+ arg = args[0]
289
+ if is_torch_dtype(arg):
290
+ # The first argument is a dtype
291
+ pass
292
+ elif isinstance(arg, str) or is_torch_device(arg) or isinstance(arg, int):
293
+ device = arg
294
+ else:
295
+ # it's something else
296
+ raise ValueError(f"Attempting to cast a BatchFeature to type {str(arg)}. This is not supported.")
297
+ # We cast only floating point tensors to avoid issues with tokenizers casting `LongTensor` to `FloatTensor`
298
+ for k, v in self.items():
299
+ if k == "pixel_values":
300
+ new_data[k] = [v[i].to(*args, **kwargs) for i in range(len(v))]
301
+ continue
302
+ # check if v is a floating point
303
+ if torch.is_floating_point(v):
304
+ # cast and send to device
305
+ new_data[k] = v.to(*args, **kwargs)
306
+ elif device is not None:
307
+ new_data[k] = v.to(device=device)
308
+ else:
309
+ new_data[k] = v
310
+ self.data = new_data
311
+ return self
312
+
313
+
314
+ def as_tensor(value):
315
+ if isinstance(value, (list, tuple)) and len(value) > 0:
316
+ if isinstance(value[0], np.ndarray):
317
+ value = np.array(value)
318
+ elif (
319
+ isinstance(value[0], (list, tuple))
320
+ and len(value[0]) > 0
321
+ and isinstance(value[0][0], np.ndarray)
322
+ ):
323
+ value = np.array(value)
324
+ if isinstance(value, np.ndarray):
325
+ return torch.from_numpy(value)
326
+ else:
327
+ return torch.tensor(value)
328
+
329
+ class ImageProcessor(BaseImageProcessor):
330
+ model_input_names = ["pixel_values"]
331
+
332
+ def __init__(
333
+ self,
334
+ size: Optional[Union[int, Tuple[int, int], Dict[str, int]]] = None,
335
+ image_mean: Optional[Union[float, List[float]]] = None,
336
+ image_std: Optional[Union[float, List[float]]] = None,
337
+ process_image_mode: Optional[str] = 'resize',
338
+ patch_size: Optional[int] = 14,
339
+ image_grid_pinpoints: List = None,
340
+ **kwargs,
341
+ ) -> None:
342
+ super().__init__(**kwargs)
343
+ self.size = size # (width, height)
344
+ self.image_mean = image_mean
345
+ self.image_std = image_std
346
+ self.process_image_mode = process_image_mode
347
+ image_grid_pinpoints = (
348
+ image_grid_pinpoints
349
+ if image_grid_pinpoints is not None
350
+ else [[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008]]
351
+ )
352
+ self.image_grid_pinpoints = image_grid_pinpoints
353
+ self.patch_size = patch_size
354
+
355
+ def preprocess(self,
356
+ images,
357
+ return_tensors: Optional[Union[str, TensorType]] = None,
358
+ data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
359
+ input_data_format: Optional[Union[str, ChannelDimension]] = None,
360
+ **kwargs,
361
+ ):
362
+ if self.process_image_mode == 'resize':
363
+ return self.resize_preprocess(images, return_tensors, data_format, input_data_format, **kwargs)
364
+ elif self.process_image_mode == 'anyres':
365
+ if processor_for_vllm == 1:
366
+ return self.anyres_for_vllm_preprocess(images, return_tensors, data_format, input_data_format, **kwargs)
367
+ return self.anyres_preprocess(images, return_tensors, data_format, input_data_format, **kwargs)
368
+ elif self.process_image_mode == 'keepratio_resize':
369
+ return self.keepratio_resize_preprocess(images, return_tensors, data_format, input_data_format, **kwargs)
370
+ elif self.process_image_mode == 'dynamic_res':
371
+ return self.dynamic_res_preprocess(images, return_tensors, data_format, input_data_format, **kwargs)
372
+ else:
373
+ raise ValueError(f"Invalid process_image_mode: {self.process_image_mode}")
374
+
375
+ def resize_preprocess(self, images, return_tensors: Optional[Union[str, TensorType]] = None, data_format: Optional[ChannelDimension] = ChannelDimension.FIRST, input_data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs):
376
+ images = make_list_of_images(images)
377
+ all_images = []
378
+ for image in images:
379
+ resized_image = image.resize(self.size, Image.BICUBIC)
380
+ transform_img = _transform(self.size[1], self.size[0], self.image_mean, self.image_std)(resized_image)
381
+ all_images.append(to_numpy_array(transform_img))
382
+
383
+ images = [
384
+ to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
385
+ for image in all_images
386
+ ]
387
+
388
+ data = {"pixel_values": images}
389
+ return CustomBatchFeature(data=data, tensor_type=return_tensors)
390
+
391
+ def keepratio_resize_preprocess(self, images, return_tensors: Optional[Union[str, TensorType]] = None, data_format: Optional[ChannelDimension] = ChannelDimension.FIRST, input_data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs):
392
+ images = make_list_of_images(images)
393
+ all_images = []
394
+ for image in images:
395
+ resized_image = keepratio_resize(image, self.size)
396
+ transform_img = _transform(self.size[1], self.size[0], self.image_mean, self.image_std)(resized_image)
397
+ all_images.append(to_numpy_array(transform_img))
398
+
399
+ images = [
400
+ to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
401
+ for image in all_images
402
+ ]
403
+
404
+ data = {"pixel_values": images}
405
+ return CustomBatchFeature(data=data, tensor_type=return_tensors)
406
+
407
+ def dynamic_res_preprocess(self, images, return_tensors: Optional[Union[str, TensorType]] = None, data_format: Optional[ChannelDimension] = ChannelDimension.FIRST, input_data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs):
408
+ images = make_list_of_images(images)
409
+ all_images = []
410
+ image_sizes = []
411
+ for image in images:
412
+ ori_w, ori_h = image.size
413
+ image_sizes.append([ori_h, ori_w])
414
+ resized_image = resize_multiple_of(image, self.patch_size, max_size=self.size)
415
+ resized_w, resized_h = resized_image.size
416
+ transform_img = _transform(resized_h, resized_w, self.image_mean, self.image_std)(resized_image)
417
+ all_images.append(to_numpy_array(transform_img))
418
+
419
+ images = [
420
+ as_tensor(to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format))
421
+ for image in all_images
422
+ ]
423
+
424
+ # data = {"pixel_values": images, "image_sizes": as_tensor(image_sizes)}
425
+ # return data
426
+ data = {"pixel_values": images, "image_sizes": image_sizes}
427
+ #return BatchFeature(data=data, data_format=data_format, tensor_type=return_tensors)
428
+
429
+ return CustomBatchFeature(data=data, tensor_type=return_tensors)
430
+
431
+ def get_image_patches(
432
+ self,
433
+ data: Image,
434
+ image_grid_pinpoints,
435
+ ):
436
+ if not isinstance(image_grid_pinpoints, list):
437
+ raise TypeError("grid_pinpoints must be a list of possible resolutions.")
438
+
439
+
440
+ best_resolution = select_best_resolution(data.size, image_grid_pinpoints)
441
+
442
+ resized_data, scale = keepratio_resize(data, best_resolution, return_scale=True)
443
+ resized_data = divide_to_patches(resized_data, self.size[0])
444
+ ori_data = data.resize(self.size, Image.BICUBIC)
445
+ data = [ori_data] + resized_data
446
+ return data
447
+
448
+ def pad(
449
+ self,
450
+ image: np.ndarray,
451
+ padding: Union[int, Tuple[int, int], Iterable[Tuple[int, int]]],
452
+ mode: PaddingMode = PaddingMode.CONSTANT,
453
+ constant_values: Union[float, Iterable[float]] = 0.0,
454
+ data_format: Optional[Union[str, ChannelDimension]] = None,
455
+ input_data_format: Optional[Union[str, ChannelDimension]] = None,
456
+ ) -> np.ndarray:
457
+ """
458
+ Pads the `image` with the specified `padding` and `mode`. Padding can be in the (`height`, `width`)
459
+ dimension of in the (`num_patches`) dimension. In the second case an iterable if tuples is expected
460
+ as input.
461
+
462
+ Args:
463
+ image (`np.ndarray`):
464
+ The image to pad.
465
+ padding (`int` or `Tuple[int, int]` or `Iterable[Tuple[int, int]]`):
466
+ Padding to apply to the edges of the height, width axes. Can be one of three formats:
467
+ - `((before_height, after_height), (before_width, after_width))` unique pad widths for each axis.
468
+ - `((before, after),)` yields same before and after pad for height and width.
469
+ - `(pad,)` or int is a shortcut for before = after = pad width for all axes.
470
+ mode (`PaddingMode`):
471
+ The padding mode to use. Can be one of:
472
+ - `"constant"`: pads with a constant value.
473
+ - `"reflect"`: pads with the reflection of the vector mirrored on the first and last values of the
474
+ vector along each axis.
475
+ - `"replicate"`: pads with the replication of the last value on the edge of the array along each axis.
476
+ - `"symmetric"`: pads with the reflection of the vector mirrored along the edge of the array.
477
+ constant_values (`float` or `Iterable[float]`, *optional*):
478
+ The value to use for the padding if `mode` is `"constant"`.
479
+ data_format (`str` or `ChannelDimension`, *optional*):
480
+ The channel dimension format for the output image. Can be one of:
481
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
482
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
483
+ If unset, will use same as the input image.
484
+ input_data_format (`str` or `ChannelDimension`, *optional*):
485
+ The channel dimension format for the input image. Can be one of:
486
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
487
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
488
+ If unset, will use the inferred format of the input image.
489
+
490
+ Returns:
491
+ `np.ndarray`: The padded image.
492
+
493
+ """
494
+
495
+ # call the general `pad` if padding on `height/width`, otherwise it's the `num_patched` dim
496
+ if isinstance(padding, int) or len(padding) != 4:
497
+ return pad(image, padding, mode, constant_values, data_format, input_data_format)
498
+
499
+ if input_data_format is None:
500
+ input_data_format = infer_channel_dimension_format(image)
501
+ if mode == PaddingMode.CONSTANT:
502
+ image = np.pad(image, padding, mode="constant", constant_values=constant_values)
503
+ elif mode == PaddingMode.REFLECT:
504
+ image = np.pad(image, padding, mode="reflect")
505
+ elif mode == PaddingMode.REPLICATE:
506
+ image = np.pad(image, padding, mode="edge")
507
+ elif mode == PaddingMode.SYMMETRIC:
508
+ image = np.pad(image, padding, mode="symmetric")
509
+ else:
510
+ raise ValueError(f"Invalid padding mode: {mode}")
511
+ image = (
512
+ to_channel_dimension_format(image, data_format, input_data_format) if data_format is not None else image
513
+ )
514
+ return image
515
+
516
+ def _pad_for_batching(
517
+ self,
518
+ pixel_values: List[np.ndarray],
519
+ data_format: Optional[Union[str, ChannelDimension]] = None,
520
+ input_data_format: Optional[Union[str, ChannelDimension]] = None,
521
+ ):
522
+ """
523
+ Pads images on the `num_of_patches` dimension with zeros to form a batch of same number of patches.
524
+
525
+ Args:
526
+ pixel_values (`List[np.ndarray]`):
527
+ An array of pixel values of each images of shape (`batch_size`, `num_patches`, `image_in_3D`)
528
+ data_format (`str` or `ChannelDimension`, *optional*):
529
+ The channel dimension format for the output image. Can be one of:
530
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
531
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
532
+ If unset, will use same as the input image.
533
+ input_data_format (`str` or `ChannelDimension`, *optional*):
534
+ The channel dimension format for the input image. Can be one of:
535
+ - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
536
+ - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
537
+ If unset, will use the inferred format of the input image.
538
+
539
+ Returns:
540
+ List[`np.ndarray`]: The padded images.
541
+ """
542
+ max_patch = max(len(x) for x in pixel_values)
543
+ pixel_values = [
544
+ self.pad(
545
+ image,
546
+ padding=((0, max_patch - image.shape[0]), (0, 0), (0, 0), (0, 0)),
547
+ data_format=data_format,
548
+ input_data_format=input_data_format,
549
+ )
550
+ for image in pixel_values
551
+ ]
552
+
553
+ return pixel_values
554
+
555
+ def anyres_for_vllm_preprocess(self, images, return_tensors: Optional[Union[str, TensorType]] = None, data_format: Optional[ChannelDimension] = ChannelDimension.FIRST, input_data_format: Optional[Union[str, ChannelDimension]] = None, do_pad: Optional[bool] = None, **kwargs):
556
+
557
+ images = make_list_of_images(images)
558
+ new_images = []
559
+ image_sizes = []
560
+
561
+ for image in images:
562
+ ori_w, ori_h = image.size
563
+ image_sizes.append([ori_h, ori_w])
564
+ image_patches = self.get_image_patches(
565
+ image,
566
+ self.image_grid_pinpoints
567
+ )
568
+ all_images = []
569
+ for image in image_patches:
570
+ transform_img = _transform(self.size[0], self.size[1], self.image_mean, self.image_std)(image)
571
+ img_array = to_numpy_array(transform_img)
572
+ img_array = to_channel_dimension_format(img_array, data_format, input_channel_dim=input_data_format)
573
+ all_images.append(img_array)
574
+ #new_images.append(img_array)
575
+ pixel_values = np.array(all_images)
576
+ new_images.append(pixel_values)
577
+
578
+
579
+ new_images = self._pad_for_batching(new_images)
580
+
581
+ data = {"pixel_values": new_images, "image_sizes": image_sizes}
582
+ return BatchFeature(data=data, tensor_type=return_tensors)
583
+
584
+
585
+ def anyres_preprocess(self, images, return_tensors: Optional[Union[str, TensorType]] = None, data_format: Optional[ChannelDimension] = ChannelDimension.FIRST, input_data_format: Optional[Union[str, ChannelDimension]] = None, do_pad: Optional[bool] = None, **kwargs):
586
+
587
+ images = make_list_of_images(images)
588
+ new_images = []
589
+ image_sizes = []
590
+
591
+ for image in images:
592
+ ori_w, ori_h = image.size
593
+ image_sizes.append([ori_h, ori_w])
594
+ image_patches = self.get_image_patches(
595
+ image,
596
+ self.image_grid_pinpoints
597
+ )
598
+ #all_images = []
599
+ for image in image_patches:
600
+ transform_img = _transform(self.size[0], self.size[1], self.image_mean, self.image_std)(image)
601
+ img_array = to_numpy_array(transform_img)
602
+ img_array = to_channel_dimension_format(img_array, data_format, input_channel_dim=input_data_format)
603
+ #all_images.append(img_array)
604
+ new_images.append(img_array)
605
+ #pixel_values = np.array(all_images)
606
+ #new_images.append(pixel_values)
607
+
608
+ # if do_pad:
609
+ # new_images = self._pad_for_batching(new_images)
610
+
611
+ data = {"pixel_values": new_images, "image_sizes": image_sizes}
612
+ return CustomBatchFeature(data=data, tensor_type=return_tensors)
613
+
614
+
615
+
616
+
main.py ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import transformers
2
+ from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig, QuantoConfig, GenerationConfig
3
+ import torch
4
+ import safetensors
5
+ import argparse
6
+ import os
7
+ import json
8
+ from PIL import Image
9
+
10
+ """
11
+ usage:
12
+ export SAFETENSORS_FAST_GPU=1
13
+ python main.py --quant_type int8 --world_size 8 --model_id <model_path> --image_path <image_path>
14
+ """
15
+
16
+ def generate_quanto_config(hf_config: AutoConfig, quant_type: str):
17
+ QUANT_TYPE_MAP = {
18
+ "default": None,
19
+ "int8": QuantoConfig(
20
+ weights="int8",
21
+ modules_to_not_convert=[
22
+ "vision_tower",
23
+ "image_newline",
24
+ "multi_modal_projector",
25
+ "lm_head",
26
+ "embed_tokens",
27
+ ] + [f"model.layers.{i}.coefficient" for i in range(hf_config.text_config.num_hidden_layers)]
28
+ + [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.text_config.num_hidden_layers)]
29
+ ),
30
+ }
31
+ return QUANT_TYPE_MAP[quant_type]
32
+
33
+ def parse_args():
34
+ parser = argparse.ArgumentParser()
35
+ parser.add_argument("--quant_type", type=str, default="default", choices=["default", "int8"])
36
+ parser.add_argument("--model_id", type=str, required=True)
37
+ parser.add_argument("--world_size", type=int, required=True)
38
+ parser.add_argument("--image_path", type=str, required=True)
39
+ return parser.parse_args()
40
+
41
+ def check_params(args, hf_config: AutoConfig):
42
+ if args.quant_type == "int8":
43
+ assert args.world_size >= 8, "int8 weight-only quantization requires at least 8 GPUs"
44
+
45
+ assert hf_config.text_config.num_hidden_layers % args.world_size == 0, f"num_hidden_layers({hf_config.text_config.num_hidden_layers}) must be divisible by world_size({args.world_size})"
46
+
47
+ @torch.no_grad()
48
+ def main():
49
+ args = parse_args()
50
+ print("\n=============== Argument ===============")
51
+ for key in vars(args):
52
+ print(f"{key}: {vars(args)[key]}")
53
+ print("========================================")
54
+
55
+ model_id = args.model_id
56
+
57
+ hf_config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
58
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
59
+ quantization_config = generate_quanto_config(hf_config, args.quant_type)
60
+
61
+ check_params(args, hf_config)
62
+
63
+ model_safetensors_index_path = os.path.join(model_id, "model.safetensors.index.json")
64
+ with open(model_safetensors_index_path, "r") as f:
65
+ model_safetensors_index = json.load(f)
66
+ weight_map = model_safetensors_index['weight_map']
67
+ vision_map = {}
68
+ for key, value in weight_map.items():
69
+ if 'vision_tower' in key or 'image_newline' in key or 'multi_modal_projector' in key:
70
+ new_key = key.replace('.weight','').replace('.bias','')
71
+ if new_key not in vision_map:
72
+ vision_map[new_key] = value
73
+ device_map = {
74
+ 'language_model.model.embed_tokens': 'cuda:0',
75
+ 'language_model.model.norm': f'cuda:{args.world_size - 1}',
76
+ 'language_model.lm_head': f'cuda:{args.world_size - 1}'
77
+ }
78
+ for key, value in vision_map.items():
79
+ device_map[key] = f'cuda:0'
80
+ device_map['vision_tower.vision_model.post_layernorm'] = f'cuda:0'
81
+ layers_per_device = hf_config.text_config.num_hidden_layers // args.world_size
82
+ for i in range(args.world_size):
83
+ for j in range(layers_per_device):
84
+ device_map[f'language_model.model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'
85
+
86
+ messages = [
87
+ {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by Minimax based on MiniMax-VL-01 model."}]},
88
+ {"role": "user", "content": [{"type": "image", "image": "placeholder"},{"type": "text", "text": "Describe this image."}]},
89
+ ]
90
+ prompt = processor.tokenizer.apply_chat_template(
91
+ messages, tokenize=False, add_generation_prompt=True
92
+ )
93
+ print(f"prompt: \n{prompt}")
94
+ raw_image = Image.open(args.image_path)
95
+ model_inputs = processor(images=[raw_image], text=prompt, return_tensors='pt').to('cuda').to(torch.bfloat16)
96
+
97
+ quantized_model = AutoModelForCausalLM.from_pretrained(
98
+ model_id,
99
+ torch_dtype="bfloat16",
100
+ device_map=device_map,
101
+ quantization_config=quantization_config,
102
+ trust_remote_code=True,
103
+ offload_buffers=True,
104
+ )
105
+ generation_config = GenerationConfig(
106
+ max_new_tokens=100,
107
+ eos_token_id=200020,
108
+ use_cache=True,
109
+ )
110
+ generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
111
+ print(f"generated_ids: {generated_ids}")
112
+ generated_ids = [
113
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
114
+ ]
115
+ response = processor.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
116
+ print(response)
117
+ # The image depicts a single, whole apple with a rich, red color. The apple appears to be fresh, with a smooth, glossy skin that reflects light, indicating its juiciness. The surface of the apple is dotted with small, light-colored
118
+
119
+ def query_safetensors(path):
120
+ safetensor = safetensors.torch.load_file(path)
121
+ for key in safetensor.keys():
122
+ print(key, safetensor[key].shape)
123
+ if __name__ == "__main__":
124
+ main()
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00000-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0065c8f5ab64c9f87f7cd6e673142c55890dfa943222bec51233ef606ec416dd
3
+ size 4916773016
model-00001-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ab8a53c8d1d0c760469af949114ba57ef6c1850fd2699cd37eeeb72a2ad3f8cb
3
+ size 2272091616
model-00002-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ddc886f947bacf60eda905d906325a1002005d23f6689560480bb0f946bd9cb7
3
+ size 2103016504
model-00003-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:03cc4611b191b9c112131145d84d9152f3bf8baca3b5f29ba2e899aa53889bbe
3
+ size 2116402768
model-00004-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:49253fa9d4675599a7443cddd5430f89a624ac9f06ce2c159fd74d8a71eca884
3
+ size 2254811112
model-00005-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ee9a12e22b07386ff6e31ee0796db67cb73133004595da86bcc516a151ef8c8c
3
+ size 2103016528
model-00006-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9a661ac4289717b68bbf16373de12c2f36a29e78253df4214fae1044a0e8fe95
3
+ size 2116402792
model-00007-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d9a92374ed8b5dc251bc75ce342b0e83eb18cfa246c1b4a4b6967825836fca33
3
+ size 2254811144
model-00008-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dfc4852d13aa5cec59b418e88ff89b7b0075c18245d15a074cc94a5ed7aca719
3
+ size 2151680728
model-00009-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c0a5af9d01837d143df71e1fe9a33b837b28032fda68479e7a436e96b9ec837f
3
+ size 2264927088
model-00010-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8576c0f3eeabf6c38918ada90fcdb3684a67e419bd86634448eda8354a1b91d3
3
+ size 2151680744
model-00011-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4e1e3422c3b97af229f07ca345dc3aa2592224ace624e465f559137cc958aa50
3
+ size 2264927080
model-00012-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:610667ce07672247bb4e352a6ff936fea8dc247c28915c40b85b8e2af19ffec9
3
+ size 2151680736
model-00013-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5d8bee23d6512e88088ecff78ad357fb09934cddd51fefec4f3d5513987db9fe
3
+ size 2264927088
model-00014-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a3c324b8e3d0dcda8fcf0614254848e3ae11277d1f4ee0dcdaa92fc4833ca7c2
3
+ size 2151680728
model-00015-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a2e300183a0267021275fab80c84bd16e3e0a762cbaf5e495a9bf2c6a195df4d
3
+ size 2264927096
model-00016-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:91c6de20aade4b94e30964acb21f39b9426aafa6add5dfc08b2fbe0901532d6d
3
+ size 2151680736
model-00017-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6b717ecbb07cf950011f3af35725229f9b5447a8fbbcd9dd0568897441320640
3
+ size 2264927080
model-00018-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc00a918371812b3d36cc1e9ee52bf74e958d91ad142c26b6bb05a811d89566a
3
+ size 2151680744
model-00019-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4652c9182fad501ba64a17dddc8241949c4cf29e8c6ae9dbcd29858cbb7d559e
3
+ size 2264927080
model-00020-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ef516243538d99ec52253e7941a8be10b74e80c603adaeafc787056963598c9d
3
+ size 2151680728
model-00021-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:52aa8727d12d4bedcea47ff4cc589b370853d1c297c315eee958b9f10995f550
3
+ size 2264927096
model-00022-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e75881dbed288aa4837d2190c357f7a4d4f472eda6794d84eea9ec2354c7dfbf
3
+ size 2151680728
model-00023-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:84150d127a938f8a9a8c5d961eb9ff13a1223e31acff39ff41caebc9d380137b
3
+ size 2264927088
model-00024-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:47ebd8c212e66da57aca48ba1b610c7ac4626661b57e93ee10a8f135f04a0232
3
+ size 2151680736
model-00025-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:58be1e67f2e11b69e1b43a73927e6f3ed234198d27a3237f6ee74507845e667c
3
+ size 2264927080
model-00026-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4a8da785dcce3e79ab2b3e30362fcfdde52e7275996e132bf12020486888a46e
3
+ size 2151680744
model-00027-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f6f13cd963b5b337ae4f7742e925ad47071423bcb96d1838688ae630ee4504ae
3
+ size 2264927088
model-00028-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d3828583b89238a35f225d21ff996cbc20e29f4ed20f36533933f52b080caa12
3
+ size 2151680728
model-00029-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8234bcb63517f7844a01fa11de3c3fab625337d071ec5df837864578bb391b0d
3
+ size 2264927096
model-00030-of-00414.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7b44dd72ff07dd79e381ecacb3d24c37b2c8dc1261fc731ab3f74e7c4cc32226
3
+ size 2151680728