MiniMax-AI commited on 2 days ago

Commit

cfde609

1 Parent(s): 305d273

Initial Commit

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +1 -0
LICENSE +42 -0
README.md +158 -3
added_tokens.json +28 -0
apple.jpg +0 -0
chat_template.json +3 -0
config.json +261 -0
configuration_minimax_text_01.py +152 -0
configuration_minimax_vl_01.py +127 -0
figures/MiniMaxLogo.png +0 -0
figures/TextBench.png +0 -0
figures/VisionBench.png +0 -0
figures/hailuo.svg +1 -0
figures/image.jpg +0 -0
figures/minimax.svg +1 -0
figures/niah.png +3 -0
image_processor.py +616 -0
main.py +124 -0
merges.txt +0 -0
model-00000-of-00414.safetensors +3 -0
model-00001-of-00414.safetensors +3 -0
model-00002-of-00414.safetensors +3 -0
model-00003-of-00414.safetensors +3 -0
model-00004-of-00414.safetensors +3 -0
model-00005-of-00414.safetensors +3 -0
model-00006-of-00414.safetensors +3 -0
model-00007-of-00414.safetensors +3 -0
model-00008-of-00414.safetensors +3 -0
model-00009-of-00414.safetensors +3 -0
model-00010-of-00414.safetensors +3 -0
model-00011-of-00414.safetensors +3 -0
model-00012-of-00414.safetensors +3 -0
model-00013-of-00414.safetensors +3 -0
model-00014-of-00414.safetensors +3 -0
model-00015-of-00414.safetensors +3 -0
model-00016-of-00414.safetensors +3 -0
model-00017-of-00414.safetensors +3 -0
model-00018-of-00414.safetensors +3 -0
model-00019-of-00414.safetensors +3 -0
model-00020-of-00414.safetensors +3 -0
model-00021-of-00414.safetensors +3 -0
model-00022-of-00414.safetensors +3 -0
model-00023-of-00414.safetensors +3 -0
model-00024-of-00414.safetensors +3 -0
model-00025-of-00414.safetensors +3 -0
model-00026-of-00414.safetensors +3 -0
model-00027-of-00414.safetensors +3 -0
model-00028-of-00414.safetensors +3 -0
model-00029-of-00414.safetensors +3 -0
model-00030-of-00414.safetensors +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+figures/niah.png filter=lfs diff=lfs merge=lfs -text

LICENSE ADDED Viewed

	@@ -0,0 +1,42 @@

+MINIMAX MODEL LICENSE AGREEMENT
+1. Definitions
+"Agreement" means the terms and conditions for use, reproduction, distribution and modification of the MiniMax Model Materials set forth herein.
+"License" or "you" means you, or your employer or any other person or entity (if you are entering into this Agreement on such person or entity’s behalf), of the age required under applicable laws, rules or regulations to provide legal consent and that has legal authority to bind your employer or such other person or entity if you are entering in this Agreement on their behalf.
+"MiniMax Model" means the foundational large language models and software and algorithms, including machine-learning model code, trained model weights, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing distributed by MiniMax at https://huggingface.co/MiniMaxAI/MiniMaxText01, https://huggingface.co/MiniMaxAI/MiniMaxVL01, https://github.com/MiniMax-AI/MiniMax01. In this agreement, MiniMax Model including MiniMaxText01 and MiniMaxVL01.
+"MiniMax Model Materials" means, collectively, MiniMax’s proprietary MiniMax Model and Documentation (and any portion thereof) made available under this Agreement.
+"MiniMax" or "we" means MiniMax AI.
+2. License Rights and Redistribution
+   a. Grant of Rights. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under MiniMax’s intellectual property or other rights owned by MiniMax embodied in the MiniMax Model Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the MiniMax Model Materials.
+   b. Redistribution and Use.
+      i. If you distribute or make available the MiniMax Model Materials (or any derivative works thereof), or a product or service that uses any of them, including another AI model, you shall (A) provide a copy of this Agreement with any such MiniMax Model Materials; and (B) prominently display “Built with MiniMax AI” on a related website, user interface, blogpost, about page, or product documentation. If you use the MiniMax Model Materials to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include “MiniMax” at the beginning of any such AI model name.
+      ii. You must retain in all copies of the MiniMax Model Materials that you distribute the following attribution notice within a “Notice” text file distributed as a part of such copies: “MiniMax AI model is licensed under the MiniMax License, Copyright © MiniMax. All Rights Reserved.”
+      iii. Your use of the MiniMax Model Materials must comply with applicable laws and regulations (including trade compliance laws and regulations) and adhere to the Prohibited Uses  Policy for the MiniMax Model Materials, which is hereby incorporated by reference into this Agreement.
+      iv. You will not use the MiniMax Model Materials or any output or results of the MiniMax Model Materials to improve any other large language model.
+3. Additional Commercial Terms. If, on the MiniMax Model Materials release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 100 million monthly active users in the preceding calendar month, you must request a license from MiniMax, which MiniMax may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until MiniMax otherwise expressly grants you such rights.
+4. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE MINIMAX MODEL MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF ANY KIND, AND MINIMAX DISCLAIMS ALL WARRANTIES OF ANY KIND, BOTH EXPRESS AND IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE MINIMAX MODEL MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE MINIMAX MODEL MATERIALS AND ANY OUTPUT AND RESULTS.
+5. Limitation of Liability. IN NO EVENT WILL MINIMAX OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF MINIMAX OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING.
+6. Intellectual Property.
+   a. No trademark licenses are granted under this Agreement, and in connection with the MiniMax Model Materials, neither MiniMax nor Licensee may use any name or mark owned by or associated with the other or any of its affiliates, except as required for reasonable and customary use in describing and redistributing the MiniMax Materials or as set forth in this Section 6(a). MiniMax hereby grants you a license to use "MiniMaxText01" or "MiniMaxVL01" (the "Mark") solely as required to comply with the last sentence of Section 2.b.i. All goodwill arising out of your use of the Mark will inure to the benefit of MiniMax.
+   b. Subject to MiniMax’s ownership of MiniMax Model Materials and derivatives made by or for MiniMax, with respect to any derivative works and modifications of the MiniMax Model Materials that are made by you, as between you and MiniMax, you are and will be the owner of such derivative works and modifications.
+   c. If you institute litigation or other proceedings against MiniMax or any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the MiniMax Model Materials or outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless MiniMax from and against any claim by any third party arising out of or related to your use or distribution of the MiniMax Model Materials.
+7. Term and Termination. The term of this Agreement will commence upon your acceptance of this Agreement or access to the MiniMax Model Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. MiniMax may terminate this Agreement if you are in breach of any term or condition of this Agreement. Upon termination of this Agreement, you shall delete and cease use of the MiniMax Model Materials. Sections 2, 3 and 6 shall survive the termination of this Agreement.
+8. Governing Law and Jurisdiction. This agreement will be governed and construed under the laws of Singapore without regard to choice of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this agreement. Any dispute arising out of or in connection with this Agreement, including any question regarding its existence, validity or termination, shall be referred to and finally resolved by arbitration administered by the Singapore International Arbitration Centre (“SIAC”) in accordance with the Arbitration Rules of the Singapore International Arbitration Centre (“SIAC Rules”) for the time being in force, which rules are deemed to be incorporated by reference in this clause.
+You agree you will not use, or allow others to use,MiniMaxText01 or MiniMaxVL01 to:
+1. Violate any applicable federal, state, local, or international law or regulation, or infringe upon the lawful rights or interests of any third party.
+2. Assist with, engage in or in any way associate with any military purpose.
+3. Exploit, harm, or attempt to exploit or harm minors in any way.
+4. Generate or disseminate false or misleading information with the intent to harm others.
+5. Generate or disseminate content prohibited by applicable laws or regulations.
+6. Generate or disseminate personally identifiable information without proper authorization or for unreasonable or unlawful purposes.
+7. Defame, disparage, harass, or cause harm to any individual or entity.
+8. Carry out fully automated decision-making that adversely affects an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation.
+9. Promote discrimination, hate speech, or harmful behavior towards individuals or groups based on race or ethnic origin, religion, disability, age, nationality and national origin, veteran status, sexual orientation, gender or gender identity, caste, immigration status, or any other legally protected characteristics or categories.

README.md CHANGED Viewed

@@ -1,3 +1,158 @@
----
-license: unknown
----

+<div align="center">
+  <img src="figures/MiniMaxLogo.png" width="60%" alt="MiniMax-Text-01" />
+</div>
+<hr>
+<div align="center" style="line-height: 1;">
+  <a href="https://www.minimaxi.com/en" target="_blank" style="margin: 2px;">
+    <img alt="Homepage" src="https://img.shields.io/badge/_Homepage-MiniMax-FF4040?style=flat-square&labelColor=2C3E50&logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hsaW5rIiB2aWV3Qm94PSIwIDAgNDkwLjE2IDQxMS43Ij48ZGVmcz48c3R5bGU+LmNscy0xe2ZpbGw6I2ZmZjt9PC9zdHlsZT48L2RlZnM+PHBhdGggY2xhc3M9ImNscy0xIiBkPSJNMjMzLjQ1LDQwLjgxYTE3LjU1LDE3LjU1LDAsMSwwLTM1LjEsMFYzMzEuNTZhNDAuODIsNDAuODIsMCwwLDEtODEuNjMsMFYxNDVhMTcuNTUsMTcuNTUsMCwxLDAtMzUuMDksMHY3OS4wNmE0MC44Miw0MC44MiwwLDAsMS04MS42MywwVjE5NS40MmExMS42MywxMS42MywwLDAsMSwyMy4yNiwwdjI4LjY2YTE3LjU1LDE3LjU1LDAsMCwwLDM1LjEsMFYxNDVBNDAuODIsNDAuODIsMCwwLDEsMTQwLDE0NVYzMzEuNTZhMTcuNTUsMTcuNTUsMCwwLDAsMzUuMSwwVjIxNy41aDBWNDAuODFhNDAuODEsNDAuODEsMCwxLDEsODEuNjIsMFYyODEuNTZhMTEuNjMsMTEuNjMsMCwxLDEtMjMuMjYsMFptMjE1LjksNjMuNEE0MC44Niw0MC44NiwwLDAsMCw0MDguNTMsMTQ1VjMwMC44NWExNy41NSwxNy41NSwwLDAsMS0zNS4wOSwwdi0yNjBhNDAuODIsNDAuODIsMCwwLDAtODEuNjMsMFYzNzAuODlhMTcuNTUsMTcuNTUsMCwwLDEtMzUuMSwwVjMzMGExMS42MywxMS42MywwLDEsMC0yMy4yNiwwdjQwLjg2YTQwLjgxLDQwLjgxLDAsMCwwLDgxLjYyLDBWNDAuODFhMTcuNTUsMTcuNTUsMCwwLDEsMzUuMSwwdjI2MGE0MC44Miw0MC44MiwwLDAsMCw4MS42MywwVjE0NWExNy41NSwxNy41NSwwLDEsMSwzNS4xLDBWMjgxLjU2YTExLjYzLDExLjYzLDAsMCwwLDIzLjI2LDBWMTQ1QTQwLjg1LDQwLjg1LDAsMCwwLDQ0OS4zNSwxMDQuMjFaIi8+PC9zdmc+&logoWidth=20" style="display: inline-block; vertical-align: middle;"/>
+  </a>
+  <a href="https://huggingface.co/MiniMaxAI" target="_blank" style="margin: 2px;">
+    <img alt="Hugging Face" src="https://img.shields.io/badge/🤗_Hugging_Face-MinMax-FF4040?style=flat-square&labelColor=2C3E50" style="display: inline-block; vertical-align: middle;"/>
+  </a>
+</div>
+<div align="center" style="line-height: 1;">
+  <a href="https://www.hailuo.ai/" target="_blank" style="margin: 2px;">
+    <img alt="Chat" src="https://img.shields.io/badge/Chat-_Hailuo AI-FF4040?style=flat-square&labelColor=2C3E50&logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hsaW5rIiB2aWV3Qm94PSIwIDAgMzc1LjE0IDM3NS4xNCI+PGRlZnM+PHN0eWxlPi5jbHMtMXtmaWxsOnVybCgjdW5uYW1lZC1ncmFkaWVudCk7fTwvc3R5bGU+PGxpbmVhckdyYWRpZW50IGlkPSJ1bm5hbWVkLWdyYWRpZW50IiB4MT0iOC40MiIgeTE9IjEzLjgxIiB4Mj0iNDI5LjY1IiB5Mj0iNDIyLjM3IiBncmFkaWVudFVuaXRzPSJ1c2VyU3BhY2VPblVzZSI+PHN0b3Agb2Zmc2V0PSIwLjA5IiBzdG9wLWNvbG9yPSIjZmZhYjBjIi8+PHN0b3Agb2Zmc2V0PSIwLjMxIiBzdG9wLWNvbG9yPSIjZmY1NTM4Ii8+PHN0b3Agb2Zmc2V0PSIwLjQ2IiBzdG9wLWNvbG9yPSIjZTk0MDVkIi8+PHN0b3Agb2Zmc2V0PSIwLjc1IiBzdG9wLWNvbG9yPSIjZDI2NmRhIi8+PHN0b3Agb2Zmc2V0PSIwLjg5IiBzdG9wLWNvbG9yPSIjZDU4NGVmIi8+PC9saW5lYXJHcmFkaWVudD48L2RlZnM+PHBhdGggY2xhc3M9ImNscy0xIiBkPSJNMzc1LjE0LDE4Ny41N0MzNzUuMTQsODQsMjkwLjc0LS4yNiwxODcuMDksMCw4NC4yNi4yNi4yNiw4NC4yNSwwLDE4Ny4wOWMtLjI2LDEwMy42NSw4NCwxODgsMTg3LjU3LDE4OEgzMTAuODJBNjQuMjEsNjQuMjEsMCwwLDAsMzc1LDMxMC45M1YxOTMuODJoMEMzNzUuMDksMTkxLjc5LDM3NS4xNCwxODkuNjcsMzc1LjE0LDE4Ny41N1ptLTI4NCwxMDQuMTdjLTI5Ljg2LTI1LjQ5LTQ4LjI2LTY2LjI3LTQ3LjQtMTA3Ljg1cS4wOS00LjM4LjQ2LTguNzNWMTc1YzQuMzItNDkuNiwzNi4zNy05NS44OCw4MS4yOS0xMTcuMzZTMjI2LjUyLDQwLjIxLDI2Ny44NSw2OHM2Ni4zMiw3OC4yMSw2My40LDEyNy45MmExNzgsMTc4LDAsMCwxLTUuMTQsMzIuMjVjLTEsNC4yLTIuMyw4LjU3LTUuMjgsMTEuNzJzLTguMiw0LjYtMTEuNzMsMi4wOWMtMy4zNy0yLjQxLTMuODctNy4xMi00LjE2LTExLjI1LTIuMzMtMzMuMzctMTEuMjQtNjcuNzYtMzMuNzktOTIuNDdhMTAzLjY3LDEwMy42NywwLDAsMC02Ni4zOC0zMi44NEExMDcuMTksMTA3LjE5LDAsMCwwLDEzMy4yMiwxMjVDMTE2LDEzNy4yNywxMDIuNTUsMTU0Ljg4LDk2LDE3NXMtNS44Niw0Mi42MSwyLjcxLDYxLjkzYTgxLjg5LDgxLjg5LDAsMCwwLDI5LjcxLDM1YzIyLjk0LDE1LjA2LDU0LjMxLDE3LjIsNzguMTQsMy42czM4LjA3LTQzLjEsMzItNjkuODZTMjA1LjQsMTU4LDE3OC4xMSwxNjAuODRjLTQuMTYuNDMtMTAuMTMsMC0xMC4yOC00LjIxLS4xMi0zLjI0LDMuNzctNC45NCw3LTUuNTIsMjcuNjgtNSw1Ny4zNCw5LjA5LDcyLjUzLDMyLjc3czE2LDU1LjQxLDMuNTYsODAuNjYtMzcsNDMuNjktNjQuMzYsNTAuMzVDMTQ5LjY4LDMyMy44NywxMTYuMzEsMzEzLjI1LDkxLjExLDI5MS43NFoiLz48L3N2Zz4=&logoWidth=16" style="display: inline-block; vertical-align: middle;"/>
+  </a>
+  <a href="https://intl.minimaxi.com" style="margin: 2px;">
+    <img alt="API" src="https://img.shields.io/badge/⚡_API-Platform-FF4040?style=flat-square&labelColor=2C3E50" style="display: inline-block; vertical-align: middle;"/>
+  </a>
+</div>
+<div align="center" style="line-height: 1;">
+  <a href="https://github.com/MiniMax-AI/MiniMax-01/blob/main/LICENSE" style="margin: 2px;">
+    <img alt="License" src="https://img.shields.io/badge/📜_License-Model_Agreement-FF4040?style=flat-square&labelColor=2C3E50" style="display: inline-block; vertical-align: middle;"/>
+  </a>
+</div>
+# MiniMax-VL-01
+## 1. Introduction
+We are delighted to introduce our **MiniMax-VL-01** model. It adopts the “ViT-MLP-LLM” framework, which is a commonly used technique in the field of multimodal large language models. The model is initialized and trained with three key parts: a 303-million-parameter Vision Transformer (ViT) for visual encoding, a randomly initialized two-layer MLP projector for image adaptation, and the MiniMax-Text-01 as the base LLM.
+MiniMax-VL-01 has a notable dynamic resolution feature. Input images are resized per a pre-set grid, with resolutions from 336×336 to 2016×2016, keeping a 336×336 thumbnail. The resized images are split into non-overlapping patches of the same size. These patches and the thumbnail are encoded separately and then combined for a full image representation.
+The training data for MiniMax-VL-01 consists of caption, description, and instruction data. The Vision Transformer (ViT) is trained on 694 million image-caption pairs from scratch. Across four distinct stages of the training pipeline, a total of 512 billion tokens are processed, leveraging this vast amount of data to endow the model with strong capabilities.
+Finally, MiniMax-VL-01 has reached top-level performance on multimodal leaderboards, demonstrating its edge and dependability in complex multimodal tasks.
+<p align="center">
+  <img width="100%" src="figures/VisionBench.png">
+</p>
+## 2. Evaluation
+| Tasks | GPT-4o<br>(11-20) | Claude-3.5-Sonnet (10-22) | Gemini-1.5-Pro (002) | Gemini-2.0-Flash (exp) | Qwen2-VL-72B-Inst. | InternVL2.5-78B | LLama-3.2-90B | MiniMax-VL-01 |
+| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
+| **Knowledge** |  |  |  |  |  |  |  |  |
+| MMMU<sup>*</sup> | 63.5 | **72.0** | 68.4  | 70.6  | 64.5 | 66.5 | 62.1 | 68.5 |
+| MMMU-Pro<sup>*</sup>  |  54.5 | 54.7 | 50.9 | **57.0**  | 43.2 | 47.3 | 36.0 | 52.7 |
+| **Visual Q&A** |  |  |  |  |  |  |  |  |
+| ChartQA<sup>*</sup><sub>relaxed</sub> | 88.1 | 90.8 | 88.7 | 88.3 | 91.2 | 91.5 | 85.5 | **91.7** |
+| DocVQA<sup>*</sup>  | 91.1 | 94.2 | 91.5 | 92.9 | **97.1** | 96.1 | 90.1 | 96.4 |
+| OCRBench | 806 | 790 | 800 | 846  | 856 | 847 | 805 | **865** |
+| **Mathematics & Sciences** ||  |  |  |  |  |  |  |
+| AI2D<sup>*</sup> | 83.1 | 82.0 | 80.9 | 85.1 | 84.4 | **86.8** | 78.9 | 83.3 |
+| MathVista<sup>*</sup>  | 62.1 | 65.4 | 70.6 | **73.1** | 69.6 | 68.4 | 57.3 | 68.6 |
+| OlympiadBench<sub>full</sub> | 25.2 | 28.4 | 32.1 | **46.1** | 21.9 | 25.1 | 19.3 | 24.2 |
+|**Long Context**|||||
+|M-LongDoc<sub>acc</sub>| **41.4** | 31.4 | 26.2 | 31.4 | 11.6 | 19.7 | 13.9 | 32.5 |
+|**Comprehensive**|||||
+|MEGA-Bench<sub>macro</sub> | 49.4 | 51.4 | 45.9 | **53.9** | 46.8 | 45.3 | 19.9 | 47.4 |
+|**User Experience**|||||
+|In-house Benchmark | 62.3 | 47.0 | 49.2 | **72.1** | 40.6 | 34.8 | 13.6 | 56.6 |
+<sup>*</sup> Evaluated following a _0-shot CoT_ setting.
+## 3. Quickstart
+Here we provide a simple example of loading the tokenizer and model to generate content.
+```python
+from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig, QuantoConfig, GenerationConfig
+import torch
+import json
+import os
+from PIL import Image
+# load hf config
+hf_config = AutoConfig.from_pretrained("MiniMax-VL-01", trust_remote_code=True)
+# quantization config, int8 is recommended
+quantization_config =  QuantoConfig(
+            weights="int8",
+            modules_to_not_convert=[
+                "vision_tower",
+                "image_newline",
+                "multi_modal_projector",
+                "lm_head",
+                "embed_tokens",
+            ] + [f"model.layers.{i}.coefficient" for i in range(hf_config.text_config.num_hidden_layers)]
+            + [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.text_config.num_hidden_layers)]
+        )
+# set device map
+model_safetensors_index_path = os.path.join("MiniMax-VL-01", "model.safetensors.index.json")
+with open(model_safetensors_index_path, "r") as f:
+    model_safetensors_index = json.load(f)
+weight_map = model_safetensors_index['weight_map']
+vision_map = {}
+for key, value in weight_map.items():
+    if 'vision_tower' in key or 'image_newline' in key or 'multi_modal_projector' in key:
+        new_key = key.replace('.weight','').replace('.bias','')
+        if new_key not in vision_map:
+            vision_map[new_key] = value
+# assume 8 GPUs
+world_size = 8
+device_map = {
+    'language_model.model.embed_tokens': 'cuda:0',
+    'language_model.model.norm': f'cuda:{world_size - 1}',
+    'language_model.lm_head': f'cuda:{world_size - 1}'
+}
+for key, value in vision_map.items():
+    device_map[key] = f'cuda:0'
+device_map['vision_tower.vision_model.post_layernorm'] = f'cuda:0'
+layers_per_device = hf_config.text_config.num_hidden_layers // world_size
+for i in range(world_size):
+    for j in range(layers_per_device):
+        device_map[f'language_model.model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'
+# load processor
+processor = AutoProcessor.from_pretrained("MiniMax-VL-01", trust_remote_code=True)
+messages = [
+    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by MiniMax based on MiniMax-VL-01 model."}]},
+    {"role": "user", "content": [{"type": "image", "image": "placeholder"},{"type": "text", "text": "Describe this image."}]},
+]
+prompt = processor.tokenizer.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+raw_image = Image.open("figures/image.jpg")
+# tokenize and move to device
+model_inputs = processor(images=[raw_image], text=prompt, return_tensors='pt').to('cuda').to(torch.bfloat16)
+# load bfloat16 model, move to device, and apply quantization
+quantized_model = AutoModelForCausalLM.from_pretrained(
+    "MiniMax-VL-01",
+    torch_dtype="bfloat16",
+    device_map=device_map,
+    quantization_config=quantization_config,
+    trust_remote_code=True,
+    offload_buffers=True,
+)
+generation_config = GenerationConfig(
+    max_new_tokens=100,
+    eos_token_id=200020,
+    use_cache=True,
+)
+# generate response
+generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
+print(f"generated_ids: {generated_ids}")
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+response = processor.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+```
+## 4. Chatbot & API
+For general use and evaluation, we provide a [Chatbot](https://www.hailuo.ai/) with online search capabilities and the [online API](https://intl.minimaxi.com) for developers.
+Contact us at [model@minimaxi.com](mailto:model@minimaxi.com).

added_tokens.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "<code_interpreter>": 200023,
+  "<commit_after>": 200018,
+  "<commit_before>": 200016,
+  "<commit_msg>": 200017,
+  "<empty_output>": 200015,
+  "<filename>": 200006,
+  "<fim_middle>": 200002,
+  "<fim_pad>": 200004,
+  "<fim_prefix>": 200001,
+  "<fim_suffix>": 200003,
+  "<function_call>": 200022,
+  "<gh_stars>": 200007,
+  "<speech>[": 200024,
+  "<image>[": 200025,
+  "<issue_closed>": 200010,
+  "<issue_comment>": 200009,
+  "<issue_start>": 200008,
+  "<jupyter_code>": 200013,
+  "<jupyter_output>": 200014,
+  "<jupyter_start>": 200011,
+  "<jupyter_text>": 200012,
+  "<reponame>": 200005,
+  "[e~[": 200020,
+  "]!d~[": 200021,
+  "]!p~[": 200000,
+  "]~b]": 200019
+}

apple.jpg ADDED Viewed

chat_template.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "chat_template": "{% for message in messages %}{% if message['role'] == 'system' %}{{ '<beginning_of_sentence>system ai_setting=assistant\n' }}{% for item in message['content'] %}{% if item.type == 'image' %}<image>{% elif item.type == 'text' %}{{ item.text }}{% endif %}{% endfor %}{{ '<end_of_sentence>\n' }}{% endif %}{% if message['role'] == 'assistant' %}{{ '<beginning_of_sentence>ai name=assistant\n' }}{% for item in message['content'] %}{% if item.type == 'image' %}<image>{% elif item.type == 'text' %}{{ item.text }}{% endif %}{% endfor %}{{ '<end_of_sentence>\n' }}{% endif %}{% if message['role'] == 'user' %}{{ '<beginning_of_sentence>user name=user\n' }}{% for item in message['content'] %}{% if item.type == 'image' %}<image>{% elif item.type == 'text' %}{{ item.text }}{% endif %}{% endfor %}{{ '<end_of_sentence>\n' }}{% endif %}{% endfor %}{{ '<beginning_of_sentence>ai name=assistant\n' }}"
+}

config.json ADDED Viewed

	@@ -0,0 +1,261 @@

+{
+  "architectures": [
+    "MiniMaxVL01ForConditionalGeneration"
+  ],
+  "auto_map": {
+    "AutoModelForCausalLM": "modeling_minimax_vl_01.MiniMaxVL01ForConditionalGeneration",
+    "AutoConfig": "configuration_minimax_vl_01.MiniMaxVL01Config"
+  },
+  "ignore_index": -100,
+  "image_grid_pinpoints": [
+      [
+          336,
+          336
+      ],
+      [
+          336,
+          672
+      ],
+      [
+          336,
+          1008
+      ],
+      [
+          336,
+          1344
+      ],
+      [
+          336,
+          1680
+      ],
+      [
+          336,
+          2016
+      ],
+      [
+          672,
+          336
+      ],
+      [
+          672,
+          672
+      ],
+      [
+          672,
+          1008
+      ],
+      [
+          672,
+          1344
+      ],
+      [
+          672,
+          1680
+      ],
+      [
+          672,
+          2016
+      ],
+      [
+          1008,
+          336
+      ],
+      [
+          1008,
+          672
+      ],
+      [
+          1008,
+          1008
+      ],
+      [
+          1008,
+          1344
+      ],
+      [
+          1008,
+          1680
+      ],
+      [
+          1008,
+          2016
+      ],
+      [
+          1344,
+          336
+      ],
+      [
+          1344,
+          672
+      ],
+      [
+          1344,
+          1008
+      ],
+      [
+          1344,
+          1344
+      ],
+      [
+          1680,
+          336
+      ],
+      [
+          1680,
+          672
+      ],
+      [
+          1680,
+          1008
+      ],
+      [
+          2016,
+          336
+      ],
+      [
+          2016,
+          672
+      ],
+      [
+          2016,
+          1008
+      ]
+  ],
+  "image_token_index": 200025,
+  "model_type": "minimax_vl_01",
+  "projector_hidden_act": "gelu",
+  "text_config": {
+    "architectures": [
+      "MiniMaxText01ForCausalLM"
+    ],
+    "attn_type_list": [
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      1,
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      1,
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      1,
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      1,
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      1,
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      1,
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      1,
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      1,
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      1,
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      0,
+      1
+    ],
+    "bos_token_id": null,
+    "eos_token_id": null,
+    "head_dim": 128,
+    "hidden_size": 6144,
+    "intermediate_size": 9216,
+    "layernorm_full_attention_alpha": 3.5565588200778455,
+    "layernorm_full_attention_beta": 1.0,
+    "layernorm_linear_attention_alpha": 3.5565588200778455,
+    "layernorm_linear_attention_beta": 1.0,
+    "layernorm_mlp_alpha": 3.5565588200778455,
+    "layernorm_mlp_beta": 1.0,
+    "max_position_embeddings": 8192,
+    "model_type": "minimax_text_01",
+    "num_attention_heads": 64,
+    "num_experts_per_tok": 2,
+    "num_hidden_layers": 80,
+    "num_key_value_heads": 8,
+    "num_local_experts": 32,
+    "postnorm": true,
+    "rms_norm_eps": 1e-05,
+    "rope_theta": 10000000,
+    "rotary_dim": 64,
+    "shared_intermediate_size": [
+      0
+    ],
+    "shared_moe_mode": "sigmoid",
+    "vocab_size": 200064
+  },
+  "transformers_version": "4.42.3",
+  "vision_config": {
+    "auto_map": {
+      "AutoModel": "modeling_clip.CLIPVisionModel"
+    },
+    "hidden_act": "gelu",
+    "hidden_size": 1024,
+    "image_size": 336,
+    "intermediate_size": 4096,
+    "model_type": "clip_vision_model",
+    "num_attention_heads": 16,
+    "num_hidden_layers": 24,
+    "patch_size": 14,
+    "projection_dim": 6144,
+    "vocab_size": 32000
+  },
+  "torch_dtype": "bfloat16",
+  "vision_feature_layer": -1,
+  "vision_feature_select_strategy": "default"
+}

configuration_minimax_text_01.py ADDED Viewed

	@@ -0,0 +1,152 @@

+""" MiniMaxText01 model configuration"""
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class MiniMaxText01Config(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`MiniMaxText01Model`]. It is used to instantiate an
+    MiniMaxText01 model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the MiniMaxText01.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 32000):
+            Vocabulary size of the MiniMaxText01 model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`MiniMaxText01Model`]
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 14336):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        num_key_value_heads (`int`, *optional*, defaults to 8):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+            by meanpooling all the original heads within that group. For more details checkout [this
+            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to `8`.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to `4096*32`):
+            The maximum sequence length that this model might ever be used with. MiniMaxText01's sliding window attention
+            allows sequence of up to 4096*32 tokens.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-05):
+            The epsilon used by the rms normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        pad_token_id (`int`, *optional*):
+            The id of the padding token.
+        bos_token_id (`int`, *optional*, defaults to 1):
+            The id of the "beginning-of-sequence" token.
+        eos_token_id (`int`, *optional*, defaults to 2):
+            The id of the "end-of-sequence" token.
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether the model's input and output word embeddings should be tied.
+        rope_theta (`float`, *optional*, defaults to 1000000.0):
+            The base period of the RoPE embeddings.
+        sliding_window (`int`, *optional*):
+            Sliding window attention window size. If not specified, will default to `4096`.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        num_experts_per_tok (`int`, *optional*, defaults to 2):
+            The number of experts to route per-token, can be also interpreted as the `top-k` routing
+            parameter
+        num_local_experts (`int`, *optional*, defaults to 8):
+            Number of experts per Sparse MLP layer.
+        output_router_logits (`bool`, *optional*, defaults to `False`):
+            Whether or not the router logits should be returned by the model. Enabeling this will also
+            allow the model to output the auxiliary loss. See [here]() for more details
+        router_aux_loss_coef (`float`, *optional*, defaults to 0.001):
+            The aux loss factor for the total loss.
+        router_jitter_noise (`float`, *optional*, defaults to 0.0):
+            Amount of noise to add to the router.
+    ```python
+    >>> from transformers import MiniMaxText01Model, MiniMaxText01Config
+    >>> # Initializing a MiniMaxText01 style configuration
+    >>> configuration = MiniMaxText01Config()
+    >>> # Initializing a model from the MiniMaxText01 style configuration
+    >>> model = MiniMaxText01Model(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "MiniMaxText01"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        vocab_size=32000,
+        hidden_size=4096,
+        intermediate_size=14336,
+        num_hidden_layers=32,
+        num_attention_heads=32,
+        num_key_value_heads=8,
+        hidden_act="silu",
+        max_position_embeddings=4096 * 32,
+        initializer_range=0.02,
+        rms_norm_eps=1e-5,
+        use_cache=True,
+        pad_token_id=None,
+        bos_token_id=None,
+        eos_token_id=None,
+        tie_word_embeddings=False,
+        rope_theta=1e6,
+        sliding_window=None,
+        attention_dropout=0.0,
+        num_experts_per_tok=2,
+        num_local_experts=8,
+        output_router_logits=False,
+        router_aux_loss_coef=0.001,
+        router_jitter_noise=0.0,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.sliding_window = sliding_window
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.use_cache = use_cache
+        self.rope_theta = rope_theta
+        self.attention_dropout = attention_dropout
+        self.num_experts_per_tok = num_experts_per_tok
+        self.num_local_experts = num_local_experts
+        self.output_router_logits = output_router_logits
+        self.router_aux_loss_coef = router_aux_loss_coef
+        self.router_jitter_noise = router_jitter_noise
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )

configuration_minimax_vl_01.py ADDED Viewed

	@@ -0,0 +1,127 @@

+"""MiniMaxVL01 model configuration"""
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+from transformers.models.auto import CONFIG_MAPPING, AutoConfig
+from .configuration_minimax_text_01 import MiniMaxText01Config
+class MiniMaxVL01Config(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`MiniMaxVL01ForConditionalGeneration`]. It is used to instantiate an
+    MiniMaxVL01 model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the MiniMaxVL01.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vision_config (`Union[AutoConfig, dict]`,  *optional*, defaults to `CLIPVisionConfig`):
+            The config object or dictionary of the vision backbone.
+        text_config (`Union[AutoConfig, dict]`, *optional*, defaults to `MiniMaxText01Config`):
+            The config object or dictionary of the text backbone.
+        ignore_index (`int`, *optional*, defaults to -100):
+            The ignore index for the loss function.
+        image_token_index (`int`, *optional*, defaults to 32000):
+            The image token index to encode the image prompt.
+        projector_hidden_act (`str`, *optional*, defaults to `"gelu"`):
+            The activation function used by the multimodal projector.
+        vision_feature_select_strategy (`str`, *optional*, defaults to `"default"`):
+            The feature selection strategy used to select the vision feature from the vision backbone.
+            Can be one of `"default"` or `"full"`. If `"default"`, the CLS token is removed from the vision features.
+            If `"full"`, the full vision features are used.
+        vision_feature_layer (`int`, *optional*, defaults to -2):
+            The index of the layer to select the vision feature.
+        image_grid_pinpoints (`List`, *optional*, defaults to `[[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008]]`):
+            A list of possible resolutions to use for processing high resolution images. Each item in the list should be a tuple or list
+            of the form `(height, width)`.
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether the model's input and output word embeddings should be tied.
+        image_seq_length (`int`, *optional*, defaults to 576):
+            Sequence length of one image embedding.
+    Example:
+    ```python
+    >>> from transformers import MiniMaxVL01ForConditionalGeneration, MiniMaxVL01Config, CLIPVisionConfig, MiniMaxText01Config
+    >>> # Initializing a CLIP-vision config
+    >>> vision_config = CLIPVisionConfig()
+    >>> # Initializing a MiniMaxText01 config
+    >>> text_config = MiniMaxText01Config()
+    >>> # Initializing a MiniMaxVL01 style configuration
+    >>> configuration = MiniMaxVL01Config(vision_config, text_config)
+    >>> # Initializing a model from the MiniMaxVL01 style configuration
+    >>> model = MiniMaxVL01ForConditionalGeneration(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "minimax_vl_01"
+    def __init__(
+        self,
+        vision_config=None,
+        text_config=None,
+        ignore_index=-100,
+        image_token_index=32000,
+        projector_hidden_act="gelu",
+        vision_feature_select_strategy="default",
+        vision_feature_layer=-2,
+        image_grid_pinpoints=None,
+        tie_word_embeddings=False,
+        image_seq_length=576,
+        **kwargs,
+    ):
+        self.ignore_index = ignore_index
+        self.image_token_index = image_token_index
+        self.projector_hidden_act = projector_hidden_act
+        self.image_seq_length = image_seq_length
+        if vision_feature_select_strategy not in ["default", "full"]:
+            raise ValueError(
+                "vision_feature_select_strategy should be one of 'default', 'full'."
+                f"Got: {vision_feature_select_strategy}"
+            )
+        self.vision_feature_select_strategy = vision_feature_select_strategy
+        self.vision_feature_layer = vision_feature_layer
+        image_grid_pinpoints = (
+            image_grid_pinpoints
+            if image_grid_pinpoints is not None
+            else [[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008]]
+        )
+        self.image_grid_pinpoints = image_grid_pinpoints
+        if isinstance(vision_config, dict):
+            vision_config["model_type"] = (
+                vision_config["model_type"] if "model_type" in vision_config else "clip_vision_model"
+            )
+            vision_config = CONFIG_MAPPING[vision_config["model_type"]](**vision_config)
+        elif vision_config is None:
+            vision_config = CONFIG_MAPPING["clip_vision_model"](
+                intermediate_size=4096,
+                hidden_size=1024,
+                patch_size=14,
+                image_size=336,
+                num_hidden_layers=24,
+                num_attention_heads=16,
+                vocab_size=32000,
+                projection_dim=768,
+            )
+        self.vision_config = vision_config
+        if text_config is not None:
+            assert "model_type" in text_config, "text_config model_type is not specified"
+            text_config = MiniMaxText01Config(**text_config)
+        else:
+            text_config = MiniMaxText01Config()
+        self.text_config = text_config
+        super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)

figures/MiniMaxLogo.png ADDED Viewed

figures/TextBench.png ADDED Viewed

figures/VisionBench.png ADDED Viewed

figures/hailuo.svg ADDED Viewed

figures/image.jpg ADDED Viewed

figures/minimax.svg ADDED Viewed

figures/niah.png ADDED Viewed

Git LFS Details

SHA256: 73fbd47b590198dad0ea6be7c45c35ce738a2978deb893c842721f0f0cf02eb8
Pointer size: 132 Bytes
Size of remote file: 1.47 MB

image_processor.py ADDED Viewed

	@@ -0,0 +1,616 @@

+from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
+from typing import Optional, Union, Tuple, Dict, List, Iterable
+from transformers.image_transforms import to_channel_dimension_format, PaddingMode
+from transformers.image_utils import ChannelDimension, to_numpy_array, make_list_of_images, get_image_size, infer_channel_dimension_format
+from transformers.utils import TensorType
+from PIL import Image
+import numpy as np
+try:
+    from torchvision.transforms import InterpolationMode
+    BICUBIC = InterpolationMode.BICUBIC
+except ImportError:
+    BICUBIC = Image.BICUBIC
+import torch
+from transformers.utils import (
+    TensorType,
+    is_torch_device,
+    is_torch_dtype,
+    requires_backends,
+)
+from torchvision.transforms import Compose, ToTensor, Normalize, ToPILImage, RandomResizedCrop, Resize
+try:
+    from torchvision.transforms import InterpolationMode
+    BICUBIC = InterpolationMode.BICUBIC
+except ImportError:
+    BICUBIC = Image.BICUBIC
+from PIL import Image
+import torch
+import numpy as np
+import os
+processor_for_vllm = int(os.getenv("PROCESSOR_FOR_VLLM", 0))
+def select_best_resolution(original_size, possible_resolutions):
+    """
+    Selects the best resolution from a list of possible resolutions based on the original size.
+    Args:
+        original_size (tuple): The original size of the image in the format (width, height).
+        possible_resolutions (list): A list of possible resolutions in the format [(width1, height1), (width2, height2), ...].
+    Returns:
+        tuple: The best fit resolution in the format (width, height).
+    """
+    original_width, original_height = original_size
+    best_fit = None
+    max_effective_resolution = 0
+    min_wasted_resolution = float("inf")
+    for width, height in possible_resolutions:
+        # Calculate the downscaled size to keep the aspect ratio
+        scale = min(width / original_width, height / original_height)
+        downscaled_width, downscaled_height = int(original_width * scale), int(original_height * scale)
+        # Calculate effective and wasted resolutions
+        effective_resolution = min(downscaled_width * downscaled_height, original_width * original_height)
+        wasted_resolution = (width * height) - effective_resolution
+        if effective_resolution > max_effective_resolution or (effective_resolution == max_effective_resolution and wasted_resolution < min_wasted_resolution):
+            max_effective_resolution = effective_resolution
+            min_wasted_resolution = wasted_resolution
+            best_fit = (width, height)
+    return best_fit
+def divide_to_patches(image, patch_size):
+    """
+    Divides an image into patches of a specified size.
+    Args:
+        image (PIL.Image.Image): The input image.
+        patch_size (int): The size of each patch.
+    Returns:
+        list: A list of PIL.Image.Image objects representing the patches.
+    """
+    patches = []
+    width, height = image.size
+    for i in range(0, height, patch_size):
+        for j in range(0, width, patch_size):
+            box = (j, i, j + patch_size, i + patch_size)
+            patch = image.crop(box)
+            patches.append(patch)
+    return patches
+def image_size_to_num_patches(image_size, grid_pinpoints, patch_size):
+    if not isinstance(grid_pinpoints, list):
+        raise TypeError("grid_pinpoints should be a list of tuples or lists")
+    # ! VERY IMPORTANT if image_size is tensor, must convert to into tuple, otherwise it will cause wrong calculate
+    if not isinstance(image_size, (list, tuple)):
+        if not isinstance(image_size, (torch.Tensor, np.ndarray)):
+            raise TypeError(f"image_size invalid type {type(image_size)} with value {image_size}")
+        image_size = image_size.tolist()
+    best_resolution = select_best_resolution(image_size, grid_pinpoints)
+    width, height = best_resolution
+    num_patches = 0
+    # consider change to ceil(height/patch_size)*ceil(width/patch_size) + 1
+    for i in range(0, height, patch_size):
+        for j in range(0, width, patch_size):
+            num_patches += 1
+    # add the base patch
+    num_patches += 1
+    return num_patches
+def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size):
+    """
+    Calculate the shape of the image patch grid after the preprocessing for images of any resolution.
+    Args:
+        image_size (`tuple`):
+            The size of the input image in the format (width, height).
+        grid_pinpoints (`List`):
+            A list containing possible resolutions. Each item in the list should be a tuple or list
+            of the form `(height, width)`.
+        patch_size (`int`):
+            The size of each image patch.
+    Returns:
+        tuple: The shape of the image patch grid in the format (width, height).
+    """
+    if not isinstance(grid_pinpoints, list):
+        raise TypeError("grid_pinpoints should be a list of tuples or lists")
+    # ! VERY IMPORTANT if image_size is tensor, must convert to into tuple, otherwise it will cause wrong calculate
+    if not isinstance(image_size, (list, tuple)):
+        if not isinstance(image_size, (torch.Tensor, np.ndarray)):
+            raise TypeError(
+                f"image_size invalid type: {type(image_size)} not valid, should be either list, tuple, np.ndarray or tensor"
+            )
+        image_size = image_size.tolist()
+    width, height = select_best_resolution(image_size, grid_pinpoints)
+    return width // patch_size, height // patch_size
+# custom transform
+class KeeyRatioResize(object):
+    def __init__(self, size):
+        self.size = size
+    def __call__(self, image):
+        return keepratio_resize(image, self.size)
+def keepratio_resize(image, size, return_scale=False):
+    # Resize the image to keep the ratio
+    w, h = image.size
+    resized_w, resized_h = size
+    if w / h > resized_w / resized_h:
+        # resize and pad to the right and left
+        new_h = int(resized_w*h/w)
+        resized_image = image.resize((resized_w, new_h), Image.BICUBIC)
+        image = Image.new('RGB', (resized_w, resized_h), (0, 0, 0))
+        pad_h = (resized_h - new_h) // 2
+        image.paste(resized_image, (0, pad_h))
+        scale = resized_w / w
+        #image.paste(resized_image, (0, 0))
+    else:
+        # resize and pad to the top and bottom
+        new_w = int(resized_h*w/h)
+        resized_image = image.resize((new_w, resized_h), Image.BICUBIC)
+        image = Image.new('RGB', (resized_w, resized_h), (0, 0, 0))
+        #image.paste(resized_image, (0, 0))
+        pad_w = (resized_w - new_w) // 2
+        image.paste(resized_image, (pad_w, 0))
+        scale = resized_h / h
+    if return_scale:
+        return image, scale
+    return image
+def _convert_image_to_rgb(image):
+    return image.convert("RGB")
+def _transform(img_h, img_w, image_mean=(0.48145466, 0.4578275, 0.40821073), image_std=(0.26862954, 0.26130258, 0.27577711)):
+    return Compose([
+        # ToPILImage(),
+        #RandomResizedCrop((img_h, img_w), scale=(0.5, 1.0), interpolation=BICUBIC),
+        #Resize((img_h, img_w), interpolation=BICUBIC),
+        _convert_image_to_rgb,
+        ToTensor(),
+        Normalize(image_mean, image_std),
+    ])
+def get_hw_multiple_of(image_size, multiple, max_size=None):
+    w, h = image_size
+    new_w = w if w % multiple == 0 else w + (multiple - w % multiple)
+    new_h = h if h % multiple == 0 else h + (multiple - h % multiple)
+    if max_size is not None:
+        assert isinstance(max_size, (list, tuple)) and len(max_size) == 2
+        max_w, max_h = max_size
+        assert max_w % multiple == 0 and max_h % multiple == 0
+        if new_w > max_w or new_h > max_h:
+            # ratio = min(max_w / new_w, max_h / new_h)
+            # new_w = int(new_w * ratio)
+            # new_h = int(new_h * ratio)
+            new_w = min((new_w * max_w) // new_w, (new_w * max_h) // new_h)
+            new_h = min((new_h * max_w) // new_w, (new_h * max_h) // new_h)
+            new_w = new_w if new_w % multiple == 0 else new_w + (multiple - new_w % multiple)
+            new_h = new_h if new_h % multiple == 0 else new_h + (multiple - new_h % multiple)
+        assert new_w % multiple == 0 and new_h % multiple == 0
+        assert new_w <= max_w and new_h <= max_h
+    return new_w, new_h
+def resize_multiple_of(image, multiple, max_size=None):
+    """
+    Resize the image to the multiple of a number.
+    Args:
+        image (PIL.Image.Image): The input image.
+        multiple (int): The number to which the image should be resized.
+    Returns:
+        PIL.Image.Image: The resized image.
+    """
+    width, height = image.size
+    new_width, new_height = get_hw_multiple_of((width, height), multiple, max_size)
+    return image.resize((new_width, new_height), Image.BICUBIC)
+class CustomBatchFeature(BatchFeature):
+    def convert_to_tensors(self, tensor_type: Optional[Union[str, TensorType]] = None):
+        """
+        Convert the inner content to tensors.
+        Args:
+            tensor_type (`str` or [`~utils.TensorType`], *optional*):
+                The type of tensors to use. If `str`, should be one of the values of the enum [`~utils.TensorType`]. If
+                `None`, no modification is done.
+        """
+        if tensor_type is None:
+            return self
+        is_tensor, as_tensor = self._get_is_as_tensor_fns(tensor_type)
+        # Do the tensor conversion in batch
+        for key, value in self.items():
+            if key == "pixel_values":
+                for i, image in enumerate(value):
+                    if not is_tensor(image):
+                        tensor = as_tensor(image)
+                        self[key][i] = tensor
+                continue
+            try:
+                if not is_tensor(value):
+                    tensor = as_tensor(value)
+                    self[key] = tensor
+            except:  # noqa E722
+                if key == "overflowing_values":
+                    raise ValueError("Unable to create tensor returning overflowing values of different lengths. ")
+                raise ValueError(
+                    "Unable to create tensor, you should probably activate padding "
+                    "with 'padding=True' to have batched tensors with the same length."
+                )
+        return self
+    def to(self, *args, **kwargs) -> "BatchFeature":
+        """
+        Send all values to device by calling `v.to(*args, **kwargs)` (PyTorch only). This should support casting in
+        different `dtypes` and sending the `BatchFeature` to a different `device`.
+        Args:
+            args (`Tuple`):
+                Will be passed to the `to(...)` function of the tensors.
+            kwargs (`Dict`, *optional*):
+                Will be passed to the `to(...)` function of the tensors.
+        Returns:
+            [`BatchFeature`]: The same instance after modification.
+        """
+        requires_backends(self, ["torch"])
+        import torch  # noqa
+        new_data = {}
+        device = kwargs.get("device")
+        # Check if the args are a device or a dtype
+        if device is None and len(args) > 0:
+            # device should be always the first argument
+            arg = args[0]
+            if is_torch_dtype(arg):
+                # The first argument is a dtype
+                pass
+            elif isinstance(arg, str) or is_torch_device(arg) or isinstance(arg, int):
+                device = arg
+            else:
+                # it's something else
+                raise ValueError(f"Attempting to cast a BatchFeature to type {str(arg)}. This is not supported.")
+        # We cast only floating point tensors to avoid issues with tokenizers casting `LongTensor` to `FloatTensor`
+        for k, v in self.items():
+            if k == "pixel_values":
+                new_data[k] = [v[i].to(*args, **kwargs) for i in range(len(v))]
+                continue
+            # check if v is a floating point
+            if torch.is_floating_point(v):
+                # cast and send to device
+                new_data[k] = v.to(*args, **kwargs)
+            elif device is not None:
+                new_data[k] = v.to(device=device)
+            else:
+                new_data[k] = v
+        self.data = new_data
+        return self
+def as_tensor(value):
+    if isinstance(value, (list, tuple)) and len(value) > 0:
+        if isinstance(value[0], np.ndarray):
+            value = np.array(value)
+        elif (
+            isinstance(value[0], (list, tuple))
+            and len(value[0]) > 0
+            and isinstance(value[0][0], np.ndarray)
+        ):
+            value = np.array(value)
+    if isinstance(value, np.ndarray):
+        return torch.from_numpy(value)
+    else:
+        return torch.tensor(value)
+class ImageProcessor(BaseImageProcessor):
+    model_input_names = ["pixel_values"]
+    def __init__(
+        self,
+        size: Optional[Union[int, Tuple[int, int], Dict[str, int]]] = None,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        process_image_mode: Optional[str] = 'resize',
+        patch_size: Optional[int] = 14,
+        image_grid_pinpoints: List = None,
+        **kwargs,
+    ) -> None:
+        super().__init__(**kwargs)
+        self.size = size # (width, height)
+        self.image_mean = image_mean
+        self.image_std = image_std
+        self.process_image_mode = process_image_mode
+        image_grid_pinpoints = (
+            image_grid_pinpoints
+            if image_grid_pinpoints is not None
+            else [[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008]]
+        )
+        self.image_grid_pinpoints = image_grid_pinpoints
+        self.patch_size = patch_size
+    def preprocess(self,
+                    images,
+                    return_tensors: Optional[Union[str, TensorType]] = None,
+                    data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
+                    input_data_format: Optional[Union[str, ChannelDimension]] = None,
+                    **kwargs,
+                    ):
+        if self.process_image_mode == 'resize':
+            return self.resize_preprocess(images, return_tensors, data_format, input_data_format, **kwargs)
+        elif self.process_image_mode == 'anyres':
+            if processor_for_vllm == 1:
+                return self.anyres_for_vllm_preprocess(images, return_tensors, data_format, input_data_format, **kwargs)
+            return self.anyres_preprocess(images, return_tensors, data_format, input_data_format, **kwargs)
+        elif self.process_image_mode == 'keepratio_resize':
+            return self.keepratio_resize_preprocess(images, return_tensors, data_format, input_data_format, **kwargs)
+        elif self.process_image_mode == 'dynamic_res':
+            return self.dynamic_res_preprocess(images, return_tensors, data_format, input_data_format, **kwargs)
+        else:
+            raise ValueError(f"Invalid process_image_mode: {self.process_image_mode}")
+    def resize_preprocess(self, images, return_tensors: Optional[Union[str, TensorType]] = None, data_format: Optional[ChannelDimension] = ChannelDimension.FIRST, input_data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs):
+        images = make_list_of_images(images)
+        all_images = []
+        for image in images:
+            resized_image = image.resize(self.size, Image.BICUBIC)
+            transform_img = _transform(self.size[1], self.size[0], self.image_mean, self.image_std)(resized_image)
+            all_images.append(to_numpy_array(transform_img))
+        images = [
+            to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
+            for image in all_images
+        ]
+        data = {"pixel_values": images}
+        return CustomBatchFeature(data=data, tensor_type=return_tensors)
+    def keepratio_resize_preprocess(self, images, return_tensors: Optional[Union[str, TensorType]] = None, data_format: Optional[ChannelDimension] = ChannelDimension.FIRST, input_data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs):
+        images = make_list_of_images(images)
+        all_images = []
+        for image in images:
+            resized_image = keepratio_resize(image, self.size)
+            transform_img = _transform(self.size[1], self.size[0], self.image_mean, self.image_std)(resized_image)
+            all_images.append(to_numpy_array(transform_img))
+        images = [
+            to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
+            for image in all_images
+        ]
+        data = {"pixel_values": images}
+        return CustomBatchFeature(data=data, tensor_type=return_tensors)
+    def dynamic_res_preprocess(self, images, return_tensors: Optional[Union[str, TensorType]] = None, data_format: Optional[ChannelDimension] = ChannelDimension.FIRST, input_data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs):
+        images = make_list_of_images(images)
+        all_images = []
+        image_sizes = []
+        for image in images:
+            ori_w, ori_h = image.size
+            image_sizes.append([ori_h, ori_w])
+            resized_image = resize_multiple_of(image, self.patch_size, max_size=self.size)
+            resized_w, resized_h = resized_image.size
+            transform_img = _transform(resized_h, resized_w, self.image_mean, self.image_std)(resized_image)
+            all_images.append(to_numpy_array(transform_img))
+        images = [
+            as_tensor(to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format))
+            for image in all_images
+        ]
+        # data = {"pixel_values": images, "image_sizes": as_tensor(image_sizes)}
+        # return data
+        data = {"pixel_values": images, "image_sizes": image_sizes}
+        #return BatchFeature(data=data, data_format=data_format, tensor_type=return_tensors)
+        return CustomBatchFeature(data=data, tensor_type=return_tensors)
+    def get_image_patches(
+        self,
+        data: Image,
+        image_grid_pinpoints,
+    ):
+        if not isinstance(image_grid_pinpoints, list):
+            raise TypeError("grid_pinpoints must be a list of possible resolutions.")
+        best_resolution = select_best_resolution(data.size, image_grid_pinpoints)
+        resized_data, scale = keepratio_resize(data, best_resolution, return_scale=True)
+        resized_data = divide_to_patches(resized_data, self.size[0])
+        ori_data = data.resize(self.size, Image.BICUBIC)
+        data = [ori_data] + resized_data
+        return data
+    def pad(
+        self,
+        image: np.ndarray,
+        padding: Union[int, Tuple[int, int], Iterable[Tuple[int, int]]],
+        mode: PaddingMode = PaddingMode.CONSTANT,
+        constant_values: Union[float, Iterable[float]] = 0.0,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+    ) -> np.ndarray:
+        """
+        Pads the `image` with the specified `padding` and `mode`. Padding can be in the (`height`, `width`)
+        dimension of in the (`num_patches`) dimension. In the second case an iterable if tuples is expected
+        as input.
+        Args:
+            image (`np.ndarray`):
+                The image to pad.
+            padding (`int` or `Tuple[int, int]` or `Iterable[Tuple[int, int]]`):
+                Padding to apply to the edges of the height, width axes. Can be one of three formats:
+                - `((before_height, after_height), (before_width, after_width))` unique pad widths for each axis.
+                - `((before, after),)` yields same before and after pad for height and width.
+                - `(pad,)` or int is a shortcut for before = after = pad width for all axes.
+            mode (`PaddingMode`):
+                The padding mode to use. Can be one of:
+                    - `"constant"`: pads with a constant value.
+                    - `"reflect"`: pads with the reflection of the vector mirrored on the first and last values of the
+                    vector along each axis.
+                    - `"replicate"`: pads with the replication of the last value on the edge of the array along each axis.
+                    - `"symmetric"`: pads with the reflection of the vector mirrored along the edge of the array.
+            constant_values (`float` or `Iterable[float]`, *optional*):
+                The value to use for the padding if `mode` is `"constant"`.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the output image. Can be one of:
+                    - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                    - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                If unset, will use same as the input image.
+            input_data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the input image. Can be one of:
+                    - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                    - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                If unset, will use the inferred format of the input image.
+        Returns:
+            `np.ndarray`: The padded image.
+        """
+        # call the general `pad` if padding on `height/width`, otherwise it's the `num_patched` dim
+        if isinstance(padding, int) or len(padding) != 4:
+            return pad(image, padding, mode, constant_values, data_format, input_data_format)
+        if input_data_format is None:
+            input_data_format = infer_channel_dimension_format(image)
+        if mode == PaddingMode.CONSTANT:
+            image = np.pad(image, padding, mode="constant", constant_values=constant_values)
+        elif mode == PaddingMode.REFLECT:
+            image = np.pad(image, padding, mode="reflect")
+        elif mode == PaddingMode.REPLICATE:
+            image = np.pad(image, padding, mode="edge")
+        elif mode == PaddingMode.SYMMETRIC:
+            image = np.pad(image, padding, mode="symmetric")
+        else:
+            raise ValueError(f"Invalid padding mode: {mode}")
+        image = (
+            to_channel_dimension_format(image, data_format, input_data_format) if data_format is not None else image
+        )
+        return image
+    def _pad_for_batching(
+        self,
+        pixel_values: List[np.ndarray],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+    ):
+        """
+        Pads images on the `num_of_patches` dimension with zeros to form a batch of same number of patches.
+        Args:
+            pixel_values (`List[np.ndarray]`):
+                An array of pixel values of each images of shape (`batch_size`, `num_patches`, `image_in_3D`)
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the output image. Can be one of:
+                    - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                    - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                If unset, will use same as the input image.
+            input_data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the input image. Can be one of:
+                    - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                    - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                If unset, will use the inferred format of the input image.
+        Returns:
+            List[`np.ndarray`]: The padded images.
+        """
+        max_patch = max(len(x) for x in pixel_values)
+        pixel_values = [
+            self.pad(
+                image,
+                padding=((0, max_patch - image.shape[0]), (0, 0), (0, 0), (0, 0)),
+                data_format=data_format,
+                input_data_format=input_data_format,
+            )
+            for image in pixel_values
+        ]
+        return pixel_values
+    def anyres_for_vllm_preprocess(self, images, return_tensors: Optional[Union[str, TensorType]] = None, data_format: Optional[ChannelDimension] = ChannelDimension.FIRST, input_data_format: Optional[Union[str, ChannelDimension]] = None, do_pad: Optional[bool] = None, **kwargs):
+        images = make_list_of_images(images)
+        new_images = []
+        image_sizes = []
+        for image in images:
+            ori_w, ori_h = image.size
+            image_sizes.append([ori_h, ori_w])
+            image_patches = self.get_image_patches(
+                image,
+                self.image_grid_pinpoints
+            )
+            all_images = []
+            for image in image_patches:
+                transform_img = _transform(self.size[0], self.size[1], self.image_mean, self.image_std)(image)
+                img_array = to_numpy_array(transform_img)
+                img_array = to_channel_dimension_format(img_array, data_format, input_channel_dim=input_data_format)
+                all_images.append(img_array)
+                #new_images.append(img_array)
+            pixel_values = np.array(all_images)
+            new_images.append(pixel_values)
+        new_images = self._pad_for_batching(new_images)
+        data = {"pixel_values": new_images, "image_sizes": image_sizes}
+        return BatchFeature(data=data, tensor_type=return_tensors)
+    def anyres_preprocess(self, images, return_tensors: Optional[Union[str, TensorType]] = None, data_format: Optional[ChannelDimension] = ChannelDimension.FIRST, input_data_format: Optional[Union[str, ChannelDimension]] = None, do_pad: Optional[bool] = None, **kwargs):
+        images = make_list_of_images(images)
+        new_images = []
+        image_sizes = []
+        for image in images:
+            ori_w, ori_h = image.size
+            image_sizes.append([ori_h, ori_w])
+            image_patches = self.get_image_patches(
+                image,
+                self.image_grid_pinpoints
+            )
+            #all_images = []
+            for image in image_patches:
+                transform_img = _transform(self.size[0], self.size[1], self.image_mean, self.image_std)(image)
+                img_array = to_numpy_array(transform_img)
+                img_array = to_channel_dimension_format(img_array, data_format, input_channel_dim=input_data_format)
+                #all_images.append(img_array)
+                new_images.append(img_array)
+            #pixel_values = np.array(all_images)
+            #new_images.append(pixel_values)
+        # if do_pad:
+        #     new_images = self._pad_for_batching(new_images)
+        data = {"pixel_values": new_images, "image_sizes": image_sizes}
+        return CustomBatchFeature(data=data, tensor_type=return_tensors)

main.py ADDED Viewed

	@@ -0,0 +1,124 @@

+import transformers
+from transformers import AutoModelForCausalLM, AutoProcessor, AutoConfig, QuantoConfig, GenerationConfig
+import torch
+import safetensors
+import argparse
+import os
+import json
+from PIL import Image
+"""
+ usage:
+    export SAFETENSORS_FAST_GPU=1
+    python main.py --quant_type int8 --world_size 8 --model_id <model_path> --image_path <image_path>
+"""
+def generate_quanto_config(hf_config: AutoConfig, quant_type: str):
+    QUANT_TYPE_MAP = {
+        "default": None,
+        "int8": QuantoConfig(
+            weights="int8",
+            modules_to_not_convert=[
+                "vision_tower",
+                "image_newline",
+                "multi_modal_projector",
+                "lm_head",
+                "embed_tokens",
+            ] + [f"model.layers.{i}.coefficient" for i in range(hf_config.text_config.num_hidden_layers)]
+            + [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.text_config.num_hidden_layers)]
+        ),
+    }
+    return QUANT_TYPE_MAP[quant_type]
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--quant_type", type=str, default="default", choices=["default", "int8"])
+    parser.add_argument("--model_id", type=str, required=True)
+    parser.add_argument("--world_size", type=int, required=True)
+    parser.add_argument("--image_path", type=str, required=True)
+    return parser.parse_args()
+def check_params(args, hf_config: AutoConfig):
+    if args.quant_type == "int8":
+        assert args.world_size >= 8, "int8 weight-only quantization requires at least 8 GPUs"
+    assert hf_config.text_config.num_hidden_layers % args.world_size == 0, f"num_hidden_layers({hf_config.text_config.num_hidden_layers}) must be divisible by world_size({args.world_size})"
+@torch.no_grad()
+def main():
+    args = parse_args()
+    print("\n=============== Argument ===============")
+    for key in vars(args):
+        print(f"{key}: {vars(args)[key]}")
+    print("========================================")
+    model_id = args.model_id
+    hf_config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
+    processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
+    quantization_config = generate_quanto_config(hf_config, args.quant_type)
+    check_params(args, hf_config)
+    model_safetensors_index_path = os.path.join(model_id, "model.safetensors.index.json")
+    with open(model_safetensors_index_path, "r") as f:
+        model_safetensors_index = json.load(f)
+    weight_map = model_safetensors_index['weight_map']
+    vision_map = {}
+    for key, value in weight_map.items():
+        if 'vision_tower' in key or 'image_newline' in key or 'multi_modal_projector' in key:
+            new_key = key.replace('.weight','').replace('.bias','')
+            if new_key not in vision_map:
+                vision_map[new_key] = value
+    device_map = {
+        'language_model.model.embed_tokens': 'cuda:0',
+        'language_model.model.norm': f'cuda:{args.world_size - 1}',
+        'language_model.lm_head': f'cuda:{args.world_size - 1}'
+    }
+    for key, value in vision_map.items():
+        device_map[key] = f'cuda:0'
+    device_map['vision_tower.vision_model.post_layernorm'] = f'cuda:0'
+    layers_per_device = hf_config.text_config.num_hidden_layers // args.world_size
+    for i in range(args.world_size):
+        for j in range(layers_per_device):
+            device_map[f'language_model.model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'
+    messages = [
+        {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by Minimax based on MiniMax-VL-01 model."}]},
+        {"role": "user", "content": [{"type": "image", "image": "placeholder"},{"type": "text", "text": "Describe this image."}]},
+    ]
+    prompt = processor.tokenizer.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+    print(f"prompt: \n{prompt}")
+    raw_image = Image.open(args.image_path)
+    model_inputs = processor(images=[raw_image], text=prompt, return_tensors='pt').to('cuda').to(torch.bfloat16)
+    quantized_model = AutoModelForCausalLM.from_pretrained(
+        model_id,
+        torch_dtype="bfloat16",
+        device_map=device_map,
+        quantization_config=quantization_config,
+        trust_remote_code=True,
+        offload_buffers=True,
+    )
+    generation_config = GenerationConfig(
+        max_new_tokens=100,
+        eos_token_id=200020,
+        use_cache=True,
+    )
+    generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
+    print(f"generated_ids: {generated_ids}")
+    generated_ids = [
+        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+    ]
+    response = processor.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+    print(response)
+    # The image depicts a single, whole apple with a rich, red color. The apple appears to be fresh, with a smooth, glossy skin that reflects light, indicating its juiciness. The surface of the apple is dotted with small, light-colored
+def query_safetensors(path):
+    safetensor = safetensors.torch.load_file(path)
+    for key in safetensor.keys():
+        print(key, safetensor[key].shape)
+if __name__ == "__main__":
+    main()

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model-00000-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0065c8f5ab64c9f87f7cd6e673142c55890dfa943222bec51233ef606ec416dd
+size 4916773016

model-00001-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ab8a53c8d1d0c760469af949114ba57ef6c1850fd2699cd37eeeb72a2ad3f8cb
+size 2272091616

model-00002-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ddc886f947bacf60eda905d906325a1002005d23f6689560480bb0f946bd9cb7
+size 2103016504

model-00003-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:03cc4611b191b9c112131145d84d9152f3bf8baca3b5f29ba2e899aa53889bbe
+size 2116402768

model-00004-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:49253fa9d4675599a7443cddd5430f89a624ac9f06ce2c159fd74d8a71eca884
+size 2254811112

model-00005-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ee9a12e22b07386ff6e31ee0796db67cb73133004595da86bcc516a151ef8c8c
+size 2103016528

model-00006-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9a661ac4289717b68bbf16373de12c2f36a29e78253df4214fae1044a0e8fe95
+size 2116402792

model-00007-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d9a92374ed8b5dc251bc75ce342b0e83eb18cfa246c1b4a4b6967825836fca33
+size 2254811144

model-00008-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dfc4852d13aa5cec59b418e88ff89b7b0075c18245d15a074cc94a5ed7aca719
+size 2151680728

model-00009-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c0a5af9d01837d143df71e1fe9a33b837b28032fda68479e7a436e96b9ec837f
+size 2264927088

model-00010-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8576c0f3eeabf6c38918ada90fcdb3684a67e419bd86634448eda8354a1b91d3
+size 2151680744

model-00011-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4e1e3422c3b97af229f07ca345dc3aa2592224ace624e465f559137cc958aa50
+size 2264927080

model-00012-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:610667ce07672247bb4e352a6ff936fea8dc247c28915c40b85b8e2af19ffec9
+size 2151680736

model-00013-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5d8bee23d6512e88088ecff78ad357fb09934cddd51fefec4f3d5513987db9fe
+size 2264927088

model-00014-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a3c324b8e3d0dcda8fcf0614254848e3ae11277d1f4ee0dcdaa92fc4833ca7c2
+size 2151680728

model-00015-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a2e300183a0267021275fab80c84bd16e3e0a762cbaf5e495a9bf2c6a195df4d
+size 2264927096

model-00016-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:91c6de20aade4b94e30964acb21f39b9426aafa6add5dfc08b2fbe0901532d6d
+size 2151680736

model-00017-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6b717ecbb07cf950011f3af35725229f9b5447a8fbbcd9dd0568897441320640
+size 2264927080

model-00018-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cc00a918371812b3d36cc1e9ee52bf74e958d91ad142c26b6bb05a811d89566a
+size 2151680744

model-00019-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4652c9182fad501ba64a17dddc8241949c4cf29e8c6ae9dbcd29858cbb7d559e
+size 2264927080

model-00020-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ef516243538d99ec52253e7941a8be10b74e80c603adaeafc787056963598c9d
+size 2151680728

model-00021-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:52aa8727d12d4bedcea47ff4cc589b370853d1c297c315eee958b9f10995f550
+size 2264927096

model-00022-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e75881dbed288aa4837d2190c357f7a4d4f472eda6794d84eea9ec2354c7dfbf
+size 2151680728

model-00023-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:84150d127a938f8a9a8c5d961eb9ff13a1223e31acff39ff41caebc9d380137b
+size 2264927088

model-00024-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:47ebd8c212e66da57aca48ba1b610c7ac4626661b57e93ee10a8f135f04a0232
+size 2151680736

model-00025-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:58be1e67f2e11b69e1b43a73927e6f3ed234198d27a3237f6ee74507845e667c
+size 2264927080

model-00026-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4a8da785dcce3e79ab2b3e30362fcfdde52e7275996e132bf12020486888a46e
+size 2151680744

model-00027-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f6f13cd963b5b337ae4f7742e925ad47071423bcb96d1838688ae630ee4504ae
+size 2264927088

model-00028-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d3828583b89238a35f225d21ff996cbc20e29f4ed20f36533933f52b080caa12
+size 2151680728

model-00029-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8234bcb63517f7844a01fa11de3c3fab625337d071ec5df837864578bb391b0d
+size 2264927096

model-00030-of-00414.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7b44dd72ff07dd79e381ecacb3d24c37b2c8dc1261fc731ab3f74e7c4cc32226
+size 2151680728