|
--- |
|
license: mit |
|
datasets: |
|
- OpenGVLab/V2PE-Data |
|
language: |
|
- en |
|
base_model: |
|
- OpenGVLab/InternVL2-2B |
|
new_version: OpenGVLab/V2PE |
|
library_name: transformers |
|
tags: |
|
- V2PE |
|
--- |
|
# V2PE |
|
|
|
[\[⭐️Project Page\]](https://zzdhybthu.github.io/V2PE.github.io) [\[📜 ArXiv Paper\]](https://arxiv.org/abs/2412.09616) [\[📂 GitHub\]](https://github.com/OpenGVLab/V2PE) [\[📖 HF Datasets\]](https://huggingface.co/datasets/OpenGVLab/V2PE-Data) |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f23418180f35af53531a6/hLydYFXbs8--Th-tOcQIe.png) |
|
|
|
## Introduction |
|
|
|
Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios, particularly in tasks involving videos, high-resolution images, or lengthy image-text documents. |
|
|
|
To address this issue, we propose Variable Visual Position Encoding (V2PE), a novel positional encoding approach that employs variable and smaller increments for visual tokens. |
|
Our experiments demonstrate the effectiveness of V2PE to enhances VLMs' ability to effectively understand and reason over long multimodal contexts. We further integrate V2PE with our augmented long-context multimodal datasets to finetune the open-source VLM, InternVL2-2B. The finetuned model achieves strong performance on both standard and long-context multimodal tasks. |
|
Notably, when the sequence length of the training dataset is increased to 256K tokens, the model is capable of processing multimodal sequences up to 1M tokens, highlighting its potential for real-world long-context applications. |
|
|
|
This repository contains the instruction-tuned V2PE-32K-InternVL-2B model and V2PE-256K-InternVL-2B model, which have 1.8B activated parameters (3B in total) and are trained on [V2PE-Data](https://huggingface.co/datasets/OpenGVLab/V2PE-Data). |
|
It is built upon [InternVL2-2B](https://huggingface.co/OpenGVLab/InternVL2-2B). For more details, please refer to our [paper](https://arxiv.org/abs/2412.09616). |
|
|
|
## Performance |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f23418180f35af53531a6/z9KD8fzQ-pkblVkpOKW7J.png) |
|
|
|
**General MLLM Benchmarks** |
|
|
|
| Model | #Param | ChartQA | DocVQA | AI2D | InfoVQA | SQA | POPE | MMMU<sub>val</sub> | MMBench<sub>EN</sub> | SEED<sub>I</sub> | Avg | |
|
|--------------------------|--------|---------|--------|-------|---------|-------|-------|--------------------|---------------------|------------------|-------| |
|
| InternVL2-2B | 2.0B | 71.7 | 86.9 | 74.1 | 58.9 | 94.1 | 85.2 | 36.3 | 73.4 | 70.9 | 72.4 | |
|
| DeepSeek-VL-1.3B | 2.0B | 47.4 | - | 51.5 | - | 68.4 | 85.9 | 33.8 | 66.4 | 66.0 | - | |
|
| Qwen2-VL-2B | 2.0B | 73.5 | 90.1 | 74.7 | 65.5 | - | - | 41.1 | 74.9 | - | - | |
|
| Aquila-VL-2B | 2.2B | 32.0 | 85.0 | 75.1 | 58.3 | 95.1 | 83.1 | 46.9 | 79.0 | 73.9 | 69.8 | |
|
| MiniCPM-V-2 | 2.8B | 55.6 | 71.9 | 62.9 | - | 80.7 | 86.3 | 38.2 | 64.1 | 67.1 | - | |
|
| Vintern-3B-beta | 3.7B | 68.3 | - | 69.1 | - | 75.0 | 87.4 | 46.7 | 70.6 | 70.0 | - | |
|
| Llama 3.2 11B | 11B | 83.4 | 88.4 | 91.1 | - | - | - | 50.7 | 68.0 | - | - | |
|
| Qwen2-VL-72B | 73B | 88.3 | 96.5 | 88.1 | 84.5 | 91.2 | 87.2 | 64.5 | 86.9 | 77.9 | 85.0 | |
|
| GPT-4o | - | 85.7 | 92.8 | 84.7 | - | 90.1 | 97.2 | 69.1 | 82.1 | 76.7 | - | |
|
| **InternVL2-V2PE-32K** | 2.0B | **76.4** | **83.9** | **73.2** | **55.9** | **94.9** | **88.8** | **36.6** | **73.5** | **71.2** | **72.5** | |
|
|
|
**Long-Context MLLM Benchmarks** |
|
|
|
| Model | #Param | MM-NIAH/Image | MM-NIAH/Text | MM-NIAH/Avg | Milebench/T | Milebench/S | Milebench/NI | Milebench/Avg | VideoMME | MVBench | |
|
|--------------------------|--------|---------------|--------------|-------------|--------------|--------------|---------------|--------------|------------|------------| |
|
| InternVL2-2B | 2.0B | 23.0 | 18.9 | 21.0 | 58.2 | 54.5 | 37.0 | 49.9 | - | - | |
|
| Phi-3-Vision | 2.7B | - | - | - | 46.9 | 50.0 | - | - | - | - | |
|
| OmChat | 3.9B | - | - | - | 51.4 | 52.0 | - | - | 45.9 | 50.2 | |
|
| LongLLaVA | 9B | - | - | - | 47.3 | 46.8 | - | - | 43.7 | 49.1 | |
|
| LongLLaVA | 13B | - | - | - | 52.7 | 52.1 | - | - | 51.6 | 54.6 | |
|
| VILA | 13B | 14.5 | 40.5 | 27.5 | - | - | - | - | - | - | |
|
| Gemini-1.5 | - | 28.5 | 82.1 | 55.2 | 50.2 | 58.3 | 97.9 | **68.8** | **69.6** | - | |
|
| GPT-4V | - | - | 84.1 | - | 45.6 | 58.9 | **99.4** | 68.0 | 59.9 | 43.5 | |
|
| GPT-4o | - | - | - | - | 56.2 | **63.5** | - | - | 64.7 | - | |
|
| Claude3-Opus | - | - | - | - | 37.4 | 48.1 | 85.3 | 56.9 | 59.7 | - | |
|
| **InternVL2-V2PE-32K** | 2.0B | **78.1** | **85.7** | **81.8** | **65.5** | 56.4 | 97.2 | 72.5 | 50.7 | **65.6** | |
|
|
|
## Usage |
|
|
|
Please refer to our [GitHub Repo](https://github.com/OpenGVLab/V2PE). |
|
|
|
## License |
|
|
|
This project is released under the MIT License. |
|
|
|
## Citation |
|
|
|
If you find this work helpful in your research, please consider citing: |
|
|
|
```bibtex |
|
@misc{ge2024v2peimprovingmultimodallongcontext, |
|
title={V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding}, |
|
author={Junqi Ge and Ziyi Chen and Jintao Lin and Jinguo Zhu and Xihui Liu and Jifeng Dai and Xizhou Zhu}, |
|
year={2024}, |
|
eprint={2412.09616}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2412.09616}, |
|
} |
|
``` |