license: apache-2.0
language:
- en
pipeline_tag: visual-question-answering
tags:
- chat
mPLUG-Owl3
Introduction
mPLUG-Owl3 is a state-of-the-art multi-modal large language model designed to tackle the challenges of long image sequence understanding. We propose Hyper Attention, which boosts the speed of long visual sequence understanding in multimodal large language models by sixfold, allowing for processing of visual sequences that are eight times longer. Meanwhile, we maintain excellent performance on single-image, multi-image, and video tasks.
Github: mPLUG-Owl
What's new
mPLUG-Owl3-7B-241101
is a improved version of mPLUG-Owl3-7B-240728
.
Fused Hyper Attention
mPLUG-Owl3 requires separate calculations for cross-attention and self-attention, and fuses the outputs of both through a adaptive gate. Now, we use a unified operation that only requires computing attention once.
New template for media inputs
We now use the following format to represent the splited high-resolution images. In addition, we can now enable image splitting when the input consists of multiple images to achieve further performance benefits, which the old version of mPLUG-Owl3 was not trained to handle with this combination.
<|start_cut|>2*3
<|image|> <|image|> <|image|>
<|image|> <|image|> <|image|>
<|image|><|end_cut|>
And we use the following format to represent video.
<|start_video_frame|><|image|><|image|><|image|><|end_video_frame|>
Adjusted media_offset
Previously, media_offset recorded the range of images each token could see. During training, since the images from multiple samples are concatenated together along the batch dimension, media_offset needed to be carefully modified, otherwise it would point to the wrong image. To prevent this, media_offset is now a List[List[int]], representing the position of each image in a sample within the batch in the original sequence. This design also makes the computation of the cross-attention mask and MI-Rope more efficient and convenient.
All of these changes are well handled by the processor, and you don't need to change the original way of calling it.
High performance on video and multi-image scenario
Model | NextQA | MVBench | VideoMME w/o sub | LongVideoBench-val | MLVU | LVBench |
---|---|---|---|---|---|---|
mPLUG-Owl3-7B-240728 | 78.6 | 54.5 | 53.5 | 52.1 | 63.7 | - |
mPLUG-Owl3-7B-241101 | 82.3 | 59.5 | 59.3 | 59.7 | 70.0 | 43.5 |
Model | NLVR2 | Mantis-Eval | MathVerse-mv | SciVerse-mv | BLINK | Q-Bench2 |
---|---|---|---|---|---|---|
mPLUG-Owl3-7B-240728 | 90.8 | 63.1 | 65.0 | 86.2 | 50.3 | 74.0 |
mPLUG-Owl3-7B-241101 | 92.7 | 67.3 | 65.1 | 82.7 | 53.8 | 77.7 |
Model | VQAv2 | OK-VQA | GQA | VizWizQA | TextVQA |
---|---|---|---|---|---|
mPLUG-Owl3-7B-240728 | 82.1 | 60.1 | 65.0 | 63.5 | 69.0 |
mPLUG-Owl3-7B-241101 | 83.2 | 61.4 | 64.7 | 62.9 | 71.4 |
Model | MMB-EN | MMB-CN | MM-Vet | POPE | AI2D |
---|---|---|---|---|---|
mPLUG-Owl3-7B-240728 | 77.6 | 74.3 | 40.1 | 88.2 | 73.8 |
mPLUG-Owl3-7B-241101 | 80.4 | 79.1 | 39.8 | 88.1 | 77.8 |
Quickstart
Load the mPLUG-Owl3. We now only support attn_implementation in ['sdpa', 'flash_attention_2']
.
import torch
from modelscope import AutoConfig, AutoModel
model_path = 'iic/mPLUG-Owl3-2B-241101'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval().cuda()
device = "cuda"
Chat with images.
from PIL import Image
from modelscope import AutoTokenizer
from decord import VideoReader, cpu
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)
image = Image.new('RGB', (500, 500), color='red')
messages = [
{"role": "user", "content": """<|image|>
Describe this image."""},
{"role": "assistant", "content": ""}
]
inputs = processor(messages, images=[image], videos=None)
inputs.to('cuda')
inputs.update({
'tokenizer': tokenizer,
'max_new_tokens':100,
'decode_text':True,
})
g = model.generate(**inputs)
print(g)
Chat with a video.
from PIL import Image
from modelscope import AutoTokenizer
from decord import VideoReader, cpu # pip install decord
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = model.init_processor(tokenizer)
messages = [
{"role": "user", "content": """<|video|>
Describe this video."""},
{"role": "assistant", "content": ""}
]
videos = ['/nas-mmu-data/examples/car_room.mp4']
MAX_NUM_FRAMES=16
def encode_video(video_path):
def uniform_sample(l, n):
gap = len(l) / n
idxs = [int(i * gap + gap / 2) for i in range(n)]
return [l[i] for i in idxs]
vr = VideoReader(video_path, ctx=cpu(0))
sample_fps = round(vr.get_avg_fps() / 1) # FPS
frame_idx = [i for i in range(0, len(vr), sample_fps)]
if len(frame_idx) > MAX_NUM_FRAMES:
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
frames = vr.get_batch(frame_idx).asnumpy()
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
print('num frames:', len(frames))
return frames
video_frames = [encode_video(_) for _ in videos]
inputs = processor(messages, images=None, videos=video_frames)
inputs.to(device)
inputs.update({
'tokenizer': tokenizer,
'max_new_tokens':100,
'decode_text':True,
})
g = model.generate(**inputs)
print(g)
Save memory by Liger-Kernel
mPLUG-Owl3 is based on Qwen2, which can be optimized through the Liger-Kernel to reduce memory usage.
pip install liger-kernel
def apply_liger_kernel_to_mplug_owl3(
rms_norm: bool = True,
swiglu: bool = True,
model = None,
) -> None:
from liger_kernel.transformers.monkey_patch import _patch_rms_norm_module
from liger_kernel.transformers.monkey_patch import _bind_method_to_module
from liger_kernel.transformers.swiglu import LigerSwiGLUMLP
"""
Apply Liger kernels to replace original implementation in HuggingFace Qwen2 models
Args:
rms_norm (bool): Whether to apply Liger's RMSNorm. Default is True.
swiglu (bool): Whether to apply Liger's SwiGLU MLP. Default is True.
model (PreTrainedModel): The model instance to apply Liger kernels to, if the model has already been
loaded. Default is None.
"""
base_model = model.language_model.model
if rms_norm:
_patch_rms_norm_module(base_model.norm)
for decoder_layer in base_model.layers:
if swiglu:
_bind_method_to_module(
decoder_layer.mlp, "forward", LigerSwiGLUMLP.forward
)
if rms_norm:
_patch_rms_norm_module(decoder_layer.input_layernorm)
_patch_rms_norm_module(decoder_layer.post_attention_layernorm)
print("Applied Liger kernels to Qwen2 in mPLUG-Owl3")
import torch
from modelscope import AutoConfig, AutoModel
model_path = 'iic/mPLUG-Owl3-2B-241101'
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
print(config)
model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval().cuda()
device = "cuda"
apply_liger_kernel_to_mplug_owl3(model=model)
Save memory by setting device_map
When you have more than one GPUs, you can set the device_map='auto'
to split the mPLUG-Owl3 into multiple GPUs. However, it will slowdown the inference speed.
model = AutoModel.from_pretrained(model_path, attn_implementation='flash_attention_2', device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
_ = model.eval()
first_layer_name = list(model.hf_device_map.keys())[0]
device = model.hf_device_map[first_layer_name]
Citation
If you find our work helpful, feel free to give us a cite.
@misc{ye2024mplugowl3longimagesequenceunderstanding,
title={mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models},
author={Jiabo Ye and Haiyang Xu and Haowei Liu and Anwen Hu and Ming Yan and Qi Qian and Ji Zhang and Fei Huang and Jingren Zhou},
year={2024},
eprint={2408.04840},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.04840},
}