opengvlab-admin
commited on
Commit
•
3ec3d5c
1
Parent(s):
f2f20b3
Update README.md
Browse files
README.md
CHANGED
@@ -21,16 +21,19 @@ pipeline_tag: visual-question-answering
|
|
21 |
|
22 |
You can run multimodal large models using a 1080Ti now.
|
23 |
|
24 |
-
We are delighted to introduce Mini-InternVL-Chat
|
25 |
|
26 |
-
As shown in the figure below, we adopted the same model architecture as InternVL 1.5. We simply replaced the original InternViT-6B with InternViT-300M and InternLM2-Chat-20B with InternLM2-Chat-1.8B. For training, we used the same data as InternVL 1.5 to train this smaller model. Additionally, due to the lower training costs of smaller models, we used a context length of 8K during training.
|
27 |
|
28 |
-
|
|
|
|
|
|
|
|
|
29 |
|
30 |
## Model Details
|
31 |
- **Model Type:** multimodal large language model (MLLM)
|
32 |
- **Model Stats:**
|
33 |
-
- Architecture: InternViT-300M-448px + MLP + [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b)
|
34 |
- Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
|
35 |
- Params: 2.2B
|
36 |
|
|
|
21 |
|
22 |
You can run multimodal large models using a 1080Ti now.
|
23 |
|
24 |
+
We are delighted to introduce the Mini-InternVL-Chat series. In the era of large language models, many researchers have started to focus on smaller language models, such as Gemma-2B, Qwen-1.8B, and InternLM2-1.8B. Inspired by their efforts, we have distilled our vision foundation model [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) down to 300M and used [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b) or [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) as our language model. This resulted in a small multimodal model with excellent performance.
|
25 |
|
|
|
26 |
|
27 |
+
As shown in the figure below, we adopted the same model architecture as InternVL 1.5. We simply replaced the original InternViT-6B with InternViT-300M and InternLM2-Chat-20B with InternLM2-Chat-1.8B / Phi-3-mini-128k-instruct. For training, we used the same data as InternVL 1.5 to train this smaller model. Additionally, due to the lower training costs of smaller models, we used a context length of 8K during training.
|
28 |
+
|
29 |
+
|
30 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/rDyoe66Sqev44T0wsP5Z7.png)
|
31 |
+
|
32 |
|
33 |
## Model Details
|
34 |
- **Model Type:** multimodal large language model (MLLM)
|
35 |
- **Model Stats:**
|
36 |
+
- Architecture: [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) + MLP + [InternLM2-Chat-1.8B](https://huggingface.co/internlm/internlm2-chat-1_8b)
|
37 |
- Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
|
38 |
- Params: 2.2B
|
39 |
|