Michael4933
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -22,12 +22,7 @@ base_model:
|
|
22 |
|
23 |
-----
|
24 |
|
25 |
-
<a href='https://michael4933.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a>
|
26 |
-
<a href='#'><img src='https://img.shields.io/badge/Demo-Page-purple'></a>
|
27 |
-
<a href='https://arxiv.org/abs/2411.03628'><img src='https://img.shields.io/badge/Paper-PDF-orange'></a>
|
28 |
-
<a href='https://huggingface.co/Michael4933/Migician'><img src='https://img.shields.io/badge/Model-Huggingface-red'></a>
|
29 |
-
<a href='https://huggingface.co/datasets/Michael4933/MIG-Bench'><img src='https://img.shields.io/badge/Benchmark-Huggingface-yellow'></a>
|
30 |
-
<a href='https://huggingface.co/datasets/Michael4933/MGrounding-630k'><img src='https://img.shields.io/badge/Dataset-Huggingface-blue'></a>
|
31 |
|
32 |
This repository hosts the usage details of our training dataset <strong>MGrounding-630k</strong> and benchmark <strong>MIG-Bench</strong> and the training implementation of Migician, the first competitive Multi-image Grounding MLLM capable of free-form grounding.
|
33 |
|
@@ -41,7 +36,7 @@ This repository hosts the usage details of our training dataset <strong>MGroundi
|
|
41 |
|
42 |
## 📝 Abstract
|
43 |
|
44 |
-
The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce
|
45 |
|
46 |
|
47 |
## 😮 Top Multi-Image Grounding Capacity
|
|
|
22 |
|
23 |
-----
|
24 |
|
25 |
+
<a href='https://michael4933.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='#'><img src='https://img.shields.io/badge/Demo-Page-purple'></a> <a href='https://arxiv.org/abs/2411.03628'><img src='https://img.shields.io/badge/Paper-PDF-orange'></a> <a href='https://huggingface.co/Michael4933/Migician'><img src='https://img.shields.io/badge/Model-Huggingface-red'></a> <a href='https://huggingface.co/datasets/Michael4933/MIG-Bench'><img src='https://img.shields.io/badge/Benchmark-Huggingface-yellow'></a> <a href='https://huggingface.co/datasets/Michael4933/MGrounding-630k'><img src='https://img.shields.io/badge/Dataset-Huggingface-blue'></a>
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
This repository hosts the usage details of our training dataset <strong>MGrounding-630k</strong> and benchmark <strong>MIG-Bench</strong> and the training implementation of Migician, the first competitive Multi-image Grounding MLLM capable of free-form grounding.
|
28 |
|
|
|
36 |
|
37 |
## 📝 Abstract
|
38 |
|
39 |
+
The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce 🎩<strong>Migician</strong>, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the [MGrounding-630k](https://huggingface.co/datasets/Michael4933/MGrounding-630k) dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose [MIG-Bench](https://huggingface.co/datasets/Michael4933/MIG-Bench), a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models.
|
40 |
|
41 |
|
42 |
## 😮 Top Multi-Image Grounding Capacity
|