license: mit
Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models
Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models
Mingxin Huang, Yuliang Liu, Dingkang Liang, Lianwen Jin, Xiang Bai
Mini-Monkey is a lightweight MLLM that incorporates a plug-and-play method called multi-scale adaptive cropping strategy (MSAC). Mini-Monkey adaptively generates multi-scale representations, allowing it to select non-segmented objects from various scales. To mitigate the computational overhead introduced by MSAC, we propose a Scale Compression Mechanism (SCM), which effectively compresses image tokens. Mini-Monkey achieves state-of-the-art performance among 2B-parameter MLLMs. It not only demonstrates leading performance on a variety of general multimodal understanding tasks but also shows consistent improvements in document understanding capabilities. On the OCRBench, Mini-Monkey achieves a score of 802, outperforming 8B-parameter state-of-the-art model InternVL2-8B. Besides, our model and training strategy are very efficient, which can be trained with only eight RTX 3090.
TODO
- Open source code, weight, and data
- Support training using 3090 GPUs (24Gb video memory)
- Mini-Monkey with different LLMs
Model Zoo
Mini-Monkey was trained using 8 3090 GPUs on a dataset
Model | #param | MME | RWQA | AI2D | CCB | SEED | HallB | POPE | MathVista | DocVQA | ChartQA | InfoVQA$ | TextVQA | OCRBench |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Mini-Gemini | 35B | 2141.0 | - | - | - | - | - | - | 43.3 | - | - | - | - | - |
LLaVA-NeXT | 35B | 2028.0 | - | 74.9 | 49.2 | 75.9 | 34.8 | 89.6 | 46.5 | - | - | - | - | - |
InternVL 1.2 | 40B | 2175.4 | 67.5 | 79.0 | 59.2 | 75.6 | 47.6 | 88.0 | 47.7 | - | - | - | - | - |
InternVL 1.5 | 26B | 2187.8 | 66.0 | 80.7 | 69.8 | 76.0 | 49.3 | 88.3 | 53.5 | 90.9 | 83.8 | 72.5 | 80.6 | 724 |
DeepSeek-VL | 1.7B | 1531.6 | 49.7 | 51.5 | 37.6 | 43.7 | 27.6 | 85.9 | 29.4 | - | - | - | - | - |
Mini-Gemini | 2.2B | 1653.0 | - | - | - | - | - | - | 29.4 | - | - | - | - | - |
Bunny-StableLM-2 | 2B | 1602.9 | - | - | - | 58.8 | - | 85.9 | - | - | - | - | - | - |
MiniCPM-V-2 | 2.8B | 1808.6 | 55.8 | 62.9 | 48.0 | - | 36.1 | 86.3 | 38.7 | 71.9 | 55.6 | - | 74.1 | 605 |
InternVL 2 | 2B | 1876.8 | 57.3 | 74.1 | 74.7 | 70.9 | 37.9 | 85.2 | 46.3 | 86.9 | 76.2 | 58.9 | 73.4 | 784 |
Mini-Monkey (ours) | 2B | 1881.9 | 57.5 | 74.7 | 75.5 | 71.3 | 38.7 | 86.7 | 47.3 | 87.4 | 76.5 | 60.1 | 75.7 | 802 |
Environment
conda create -n minimonkey python=3.10
conda activate minimonkey
git clone https://github.com/Yuliang-Liu/Monkey.git
cd ./Monkey/project/mini_monkey
pip install -r requirements.txt
Install flash-attn==2.3.6
:
pip install flash-attn==2.3.6 --no-build-isolation
Alternatively you can compile from source:
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v2.3.6
python setup.py install
Evaluate
We use VLMEvalKit repositories for model evaluation.
Inference
We provide an example of inference code here
Train
Prepare Training Datasets
Inspired by InternVL 1.2, we adopted a LLaVA-ZH, DVQA, ChartQA, AI2D, DocVQA, GeoQA+, and SynthDoG-EN. Most of the data remains consistent with InternVL 1.2.
First, download the annotation files and place them in the playground/opensource/
folder.
Second, download all the images we used.
- AI2D: ai2d_images (provided by InternLM-XComposer)
- ChartQA: ChartQA Dataset
- COCO: train2017
- DocVQA: train, val, test
- DVQA: images
- LLaVA-Pretrain: images
- SynthDoG-EN: We only use 00000~00004 parquet files for now, with a total of 30K images. We provide the converted images.
- GeoQA+: GeoQA+ images
Then, organize the data as follows in playground/data
:
playground/
βββ opensource
β βββ ai2d_train_12k.jsonl
β βββ chartqa_train_18k.jsonl
β βββ docvqa_train_10k.jsonl
β βββ dvqa_train_200k.jsonl
β βββ geoqa+.jsonl
β βββ llava_instruct_150k_zh.jsonl
β βββ synthdog_en.jsonl
βββ data
β βββ ai2d
β β βββ abc_images
β β βββ images
β βββ chartqa
β β βββ test
β β βββ train
β β βββ val
β βββ coco
β β βββ train2017
β βββ docvqa
β β βββ test
β β βββ train
β β βββ val
β βββ dvqa
β β βββ images
β βββ llava
β β βββ llava_pretrain
β β βββ images
β βββ synthdog-en
β β βββ images
β βββ geoqa+
β β βββ images
Execute the training code:
sh shell/minimonkey/minimonkey_finetune_full.sh
Citing Mini-Monkey
If you wish to refer to the baseline results published here, please use the following BibTeX entries:
@article{huang2024mini,
title={Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models},
author={Huang, Mingxin and Liu, Yuliang and Liang, Dingkang and Jin, Lianwen and Bai, Xiang},
journal={arXiv preprint arXiv:2408.02034},
year={2024}
}
Copyright
We welcome suggestions to help us improve the Mini-Monkey. For any query, please contact Dr. Yuliang Liu: ylliu@hust.edu.cn. If you find something interesting, please also feel free to share with us through email or open an issue.