File size: 7,415 Bytes
1a08e09 cce75f9 1a08e09 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
---
license: mit
---
# Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models
<br>
<p align="center">
<img src="https://v1.ax1x.com/2024/08/13/7GXu34.png" width="300"/>
<p>
> [**Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models**](https://arxiv.org/abs/2408.02034)<br>
> Mingxin Huang, Yuliang Liu, Dingkang Liang, Lianwen Jin, Xiang Bai <br>
[![arXiv](https://img.shields.io/badge/Arxiv-2408.02034-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2408.02034)
[![Demo](https://img.shields.io/badge/Demo-blue)](http://vlrlab-monkey.xyz:7685)
[![Model Weight](https://img.shields.io/badge/Model_Weight-gray)](https://huggingface.co/mx262/MiniMokney)
-----
**Mini-Monkey** is a lightweight MLLM that incorporates a plug-and-play method called multi-scale adaptive cropping strategy (MSAC). Mini-Monkey adaptively generates multi-scale representations, allowing it to select non-segmented objects from various scales. To mitigate the computational overhead introduced by MSAC, we propose a Scale Compression Mechanism (SCM), which effectively compresses image tokens. Mini-Monkey achieves state-of-the-art performance among 2B-parameter MLLMs. It not only demonstrates leading performance on a variety of general multimodal understanding tasks but also shows consistent improvements in document understanding capabilities. On the OCRBench, Mini-Monkey achieves a score of 802, outperforming 8B-parameter state-of-the-art model InternVL2-8B. Besides, our model and training strategy are very efficient, which can be trained with only eight RTX 3090.
# TODO
- [x] Open source code, weight, and data
- [x] Support training using 3090 GPUs (24Gb video memory)
- [ ] Mini-Monkey with different LLMs
# Model Zoo
Mini-Monkey was trained using 8 3090 GPUs on a dataset
| Model | #param | MME | RWQA | AI2D | CCB | SEED | HallB | POPE | MathVista | DocVQA | ChartQA | InfoVQA$ | TextVQA | OCRBench |
|-------|---------|-----|------|------|-----|------|-------|------|-----------|-------------------|-------------------|-------------------|----------------|----------|
| Mini-Gemini | 35B | 2141.0 | - | - | - | - | - | - | 43.3 | - | - | - | - | - |
| LLaVA-NeXT | 35B | 2028.0 | - | 74.9 | 49.2 | 75.9 | 34.8 | 89.6 | 46.5 | - | - | - | - | - |
| InternVL 1.2 | 40B | 2175.4 | 67.5 | 79.0 | 59.2 | 75.6 | 47.6 | 88.0 | 47.7 | - | - | - | - | - |
| InternVL 1.5 | 26B | 2187.8 | 66.0 | 80.7 | 69.8 | 76.0 | 49.3 | 88.3 | 53.5 | 90.9 | 83.8 | 72.5 | 80.6 | 724 |
| DeepSeek-VL | 1.7B | 1531.6 | 49.7 | 51.5 | 37.6 | 43.7 | 27.6 | 85.9 | 29.4 | - | - | - | - | - |
| Mini-Gemini | 2.2B | 1653.0 | - | - | - | - | - | - | 29.4 | - | - | - | - | - |
| Bunny-StableLM-2 | 2B | 1602.9 | - | - | - | 58.8 | - | 85.9 | - | - | - | - | - | - |
| MiniCPM-V-2 | 2.8B | 1808.6 | 55.8 | 62.9 | 48.0 | - | 36.1 | 86.3 | 38.7 | 71.9 | 55.6 | - | 74.1 | 605 |
| InternVL 2 | 2B | 1876.8 | 57.3 | 74.1 | 74.7 | 70.9 | 37.9 | 85.2 | 46.3 | 86.9 | 76.2 | 58.9 | 73.4 | 784 |
| Mini-Monkey (ours) | 2B | 1881.9 | 57.5 | 74.7 | 75.5 | 71.3 | 38.7 | 86.7 | 47.3 | 87.4 | 76.5 | 60.1 | 75.7 | 802 |
## Environment
```python
conda create -n minimonkey python=3.10
conda activate minimonkey
git clone https://github.com/Yuliang-Liu/Monkey.git
cd ./Monkey/project/mini_monkey
pip install -r requirements.txt
```
Install `flash-attn==2.3.6`:
```bash
pip install flash-attn==2.3.6 --no-build-isolation
```
Alternatively you can compile from source:
```bash
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v2.3.6
python setup.py install
```
## Evaluate
We use [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) repositories for model evaluation.
## Inference
We provide an example of inference code [here](https://github.com/Yuliang-Liu/Monkey/blob/main/project/mini_monkey/demo.py)
## Train
### Prepare Training Datasets
Inspired by InternVL 1.2, we adopted a [LLaVA-ZH](https://huggingface.co/datasets/openbmb/llava_zh), [DVQA](https://github.com/kushalkafle/DVQA_dataset), [ChartQA](https://github.com/vis-nlp/ChartQA), [AI2D](https://allenai.org/data/diagrams), [DocVQA](https://www.docvqa.org/datasets), [GeoQA+](https://github.com/SCNU203/GeoQA-Plus), and [SynthDoG-EN](https://huggingface.co/datasets/naver-clova-ix/synthdog-en). Most of the data remains consistent with InternVL 1.2.
First, download the [annotation files](https://huggingface.co/OpenGVLab/InternVL/resolve/main/playground.zip) and place them in the `playground/opensource/` folder.
Second, download all the images we used.
- AI2D: [ai2d_images](https://drive.google.com/file/d/1dqqa3MnrxMXaU_K9JA6C83je32ibwdOY/view?usp=sharing) (provided by InternLM-XComposer)
- ChartQA: [ChartQA Dataset](https://huggingface.co/datasets/ahmed-masry/ChartQA/resolve/main/ChartQA%20Dataset.zip)
- COCO: [train2017](http://images.cocodataset.org/zips/train2017.zip)
- DocVQA: [train](https://datasets.cvc.uab.es/rrc/DocVQA/train.tar.gz), [val](https://datasets.cvc.uab.es/rrc/DocVQA/val.tar.gz), [test](https://datasets.cvc.uab.es/rrc/DocVQA/test.tar.gz)
- DVQA: [images](https://drive.google.com/file/d/1iKH2lTi1-QxtNUVRxTUWFvUvRHq6HAsZ/view)
- LLaVA-Pretrain: [images](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/resolve/main/images.zip)
- SynthDoG-EN: We only use 00000~00004 parquet files for now, with a total of 30K images. We provide the converted [images](https://huggingface.co/OpenGVLab/InternVL/resolve/main/synthdog-en-images.zip).
- GeoQA+: [GeoQA+](https://drive.google.com/file/d/1KL4_wIzr3p8XSKMkkLgYcYwCbb0TzZ9O/view) [images](https://huggingface.co/OpenGVLab/InternVL/resolve/main/geoqa%2B_images.zip)
Then, organize the data as follows in `playground/data`:
```none
playground/
βββ opensource
β βββ ai2d_train_12k.jsonl
β βββ chartqa_train_18k.jsonl
β βββ docvqa_train_10k.jsonl
β βββ dvqa_train_200k.jsonl
β βββ geoqa+.jsonl
β βββ llava_instruct_150k_zh.jsonl
β βββ synthdog_en.jsonl
βββ data
β βββ ai2d
β β βββ abc_images
β β βββ images
β βββ chartqa
β β βββ test
β β βββ train
β β βββ val
β βββ coco
β β βββ train2017
β βββ docvqa
β β βββ test
β β βββ train
β β βββ val
β βββ dvqa
β β βββ images
β βββ llava
β β βββ llava_pretrain
β β βββ images
β βββ synthdog-en
β β βββ images
β βββ geoqa+
β β βββ images
```
Execute the training code:
```python
sh shell/minimonkey/minimonkey_finetune_full.sh
```
## Citing Mini-Monkey
If you wish to refer to the baseline results published here, please use the following BibTeX entries:
```BibTeX
@article{huang2024mini,
title={Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models},
author={Huang, Mingxin and Liu, Yuliang and Liang, Dingkang and Jin, Lianwen and Bai, Xiang},
journal={arXiv preprint arXiv:2408.02034},
year={2024}
}
```
## Copyright
We welcome suggestions to help us improve the Mini-Monkey. For any query, please contact Dr. Yuliang Liu: ylliu@hust.edu.cn. If you find something interesting, please also feel free to share with us through email or open an issue. |