Michael4933
/

Migician

Safetensors

English

qwen2_vl

Model card Files Files and versions Community

Michael4933 commited on 25 days ago

Commit

65f69b2

verified ·

1 Parent(s): a1f6c6e

Update README.md

Browse files

Files changed (1) hide show

README.md +2 -7

README.md CHANGED Viewed

@@ -22,12 +22,7 @@ base_model:
 -----
-<a href='https://michael4933.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a>
-<a href='#'><img src='https://img.shields.io/badge/Demo-Page-purple'></a>
-<a href='https://arxiv.org/abs/2411.03628'><img src='https://img.shields.io/badge/Paper-PDF-orange'></a>
-<a href='https://huggingface.co/Michael4933/Migician'><img src='https://img.shields.io/badge/Model-Huggingface-red'></a>
-<a href='https://huggingface.co/datasets/Michael4933/MIG-Bench'><img src='https://img.shields.io/badge/Benchmark-Huggingface-yellow'></a>
-<a href='https://huggingface.co/datasets/Michael4933/MGrounding-630k'><img src='https://img.shields.io/badge/Dataset-Huggingface-blue'></a>
 This repository hosts the usage details of our training dataset <strong>MGrounding-630k</strong> and benchmark <strong>MIG-Bench</strong> and the training implementation of Migician, the first competitive Multi-image Grounding MLLM capable of free-form grounding.
@@ -41,7 +36,7 @@ This repository hosts the usage details of our training dataset <strong>MGroundi
 ## 📝 Abstract
-The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce <strong>Migician</strong><img src="figs/logo.png" style="width: 3%">, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the [MGrounding-630k](https://huggingface.co/datasets/Michael4933/MGrounding-630k) dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose [MIG-Bench](https://huggingface.co/datasets/Michael4933/MIG-Bench), a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models.
 ## 😮 Top Multi-Image Grounding Capacity

 -----
+<a href='https://michael4933.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='#'><img src='https://img.shields.io/badge/Demo-Page-purple'></a>  <a href='https://arxiv.org/abs/2411.03628'><img src='https://img.shields.io/badge/Paper-PDF-orange'></a>  <a href='https://huggingface.co/Michael4933/Migician'><img src='https://img.shields.io/badge/Model-Huggingface-red'></a>  <a href='https://huggingface.co/datasets/Michael4933/MIG-Bench'><img src='https://img.shields.io/badge/Benchmark-Huggingface-yellow'></a>  <a href='https://huggingface.co/datasets/Michael4933/MGrounding-630k'><img src='https://img.shields.io/badge/Dataset-Huggingface-blue'></a>
 This repository hosts the usage details of our training dataset <strong>MGrounding-630k</strong> and benchmark <strong>MIG-Bench</strong> and the training implementation of Migician, the first competitive Multi-image Grounding MLLM capable of free-form grounding.
 ## 📝 Abstract
+The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce 🎩<strong>Migician</strong>, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the [MGrounding-630k](https://huggingface.co/datasets/Michael4933/MGrounding-630k) dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose [MIG-Bench](https://huggingface.co/datasets/Michael4933/MIG-Bench), a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models.
 ## 😮 Top Multi-Image Grounding Capacity