Safetensors
English
qwen2_vl
Michael4933 commited on
Commit
65f69b2
·
verified ·
1 Parent(s): a1f6c6e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -7
README.md CHANGED
@@ -22,12 +22,7 @@ base_model:
22
 
23
  -----
24
 
25
- <a href='https://michael4933.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a>
26
- <a href='#'><img src='https://img.shields.io/badge/Demo-Page-purple'></a>
27
- <a href='https://arxiv.org/abs/2411.03628'><img src='https://img.shields.io/badge/Paper-PDF-orange'></a>
28
- <a href='https://huggingface.co/Michael4933/Migician'><img src='https://img.shields.io/badge/Model-Huggingface-red'></a>
29
- <a href='https://huggingface.co/datasets/Michael4933/MIG-Bench'><img src='https://img.shields.io/badge/Benchmark-Huggingface-yellow'></a>
30
- <a href='https://huggingface.co/datasets/Michael4933/MGrounding-630k'><img src='https://img.shields.io/badge/Dataset-Huggingface-blue'></a>
31
 
32
  This repository hosts the usage details of our training dataset <strong>MGrounding-630k</strong> and benchmark <strong>MIG-Bench</strong> and the training implementation of Migician, the first competitive Multi-image Grounding MLLM capable of free-form grounding.
33
 
@@ -41,7 +36,7 @@ This repository hosts the usage details of our training dataset <strong>MGroundi
41
 
42
  ## 📝 Abstract
43
 
44
- The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce <strong>Migician</strong><img src="figs/logo.png" style="width: 3%">, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the [MGrounding-630k](https://huggingface.co/datasets/Michael4933/MGrounding-630k) dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose [MIG-Bench](https://huggingface.co/datasets/Michael4933/MIG-Bench), a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models.
45
 
46
 
47
  ## 😮 Top Multi-Image Grounding Capacity
 
22
 
23
  -----
24
 
25
+ <a href='https://michael4933.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='#'><img src='https://img.shields.io/badge/Demo-Page-purple'></a> <a href='https://arxiv.org/abs/2411.03628'><img src='https://img.shields.io/badge/Paper-PDF-orange'></a> <a href='https://huggingface.co/Michael4933/Migician'><img src='https://img.shields.io/badge/Model-Huggingface-red'></a> <a href='https://huggingface.co/datasets/Michael4933/MIG-Bench'><img src='https://img.shields.io/badge/Benchmark-Huggingface-yellow'></a> <a href='https://huggingface.co/datasets/Michael4933/MGrounding-630k'><img src='https://img.shields.io/badge/Dataset-Huggingface-blue'></a>
 
 
 
 
 
26
 
27
  This repository hosts the usage details of our training dataset <strong>MGrounding-630k</strong> and benchmark <strong>MIG-Bench</strong> and the training implementation of Migician, the first competitive Multi-image Grounding MLLM capable of free-form grounding.
28
 
 
36
 
37
  ## 📝 Abstract
38
 
39
+ The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce 🎩<strong>Migician</strong>, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the [MGrounding-630k](https://huggingface.co/datasets/Michael4933/MGrounding-630k) dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose [MIG-Bench](https://huggingface.co/datasets/Michael4933/MIG-Bench), a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models.
40
 
41
 
42
  ## 😮 Top Multi-Image Grounding Capacity