|
--- |
|
license: cc-by-4.0 |
|
--- |
|
|
|
# Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions |
|
|
|
π [**Paper**](https://arxiv.org/abs/2402.11530) | π [**Code**](https://github.com/BAAI-DCAI/Multimodal-Robustness-Benchmark) | π [**Data**](https://huggingface.co/datasets/BAAI/Multimodal-Robustness-Benchmark) |
|
|
|
|
|
## Overview |
|
|
|
MMR provides a comprehensive suite to evaluate the understanding capabilities of Multimodal Large Language Models (MLLMs) and their robustness when handling negative questions after correctly interpreting visual content. The MMR benchmark includes: |
|
|
|
1. **Multimodal Robustness (MMR) Benchmark and Targeted Evaluation Metrics:** |
|
- Comprising 12 categories of paired positive and negative questions. |
|
- Each question is meticulously annotated by experts to ensure scientific validity and accuracy. |
|
|
|
2. **Specially Designed Training Set:** |
|
- Contains paired positive and negative visual question-answer samples to enhance robustness. |
|
|
|
3. **Combined Dataset and Models:** |
|
- The new dataset merges the proposed dataset with existing ones. |
|
- Trained models include Bunny-MMR-3B, Bunny-MMR-4B, and Bunny-MMR-8B. |
|
|
|
In this repository, we provide Bunny-MMR-8B, which is built upon [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) and [Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct). More details about this model can be found in [GitHub](https://github.com/BAAI-DCAI/Multimodal-Robustness-Benchmark). |
|
|
|
|
|
## Key Features |
|
|
|
- **Rigorous Testing:** |
|
- Extensive testing on leading MLLMs shows that while these models can correctly interpret visual content, they exhibit significant vulnerabilities when faced with leading questions. |
|
|
|
- **Enhanced Robustness:** |
|
- The targeted training significantly improves the MLLMs' ability to handle negative questions effectively. |
|
|
|
|
|
# Quickstart |
|
|
|
Here we show a code snippet to show you how to use the model with transformers. |
|
|
|
Before running the snippet, you need to install the following dependencies: |
|
|
|
```shell |
|
pip install torch transformers accelerate pillow |
|
``` |
|
|
|
```python |
|
import torch |
|
import transformers |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
from PIL import Image |
|
import warnings |
|
|
|
# disable some warnings |
|
transformers.logging.set_verbosity_error() |
|
transformers.logging.disable_progress_bar() |
|
warnings.filterwarnings('ignore') |
|
|
|
# set device |
|
torch.set_default_device('cpu') # or 'cuda' |
|
|
|
# create model |
|
model = AutoModelForCausalLM.from_pretrained( |
|
'AI4VR/Bunny-MMR-8B', |
|
torch_dtype=torch.float16, |
|
device_map='auto', |
|
trust_remote_code=True) |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
'AI4VR/Bunny-MMR-8B', |
|
trust_remote_code=True) |
|
|
|
# text prompt |
|
prompt = 'text prompt' |
|
text = f"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n{prompt} ASSISTANT:" |
|
text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')] |
|
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][1:], dtype=torch.long).unsqueeze(0) |
|
|
|
# image, sample images can be found in images folder |
|
image = Image.open('path/to/image') |
|
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype) |
|
|
|
# generate |
|
output_ids = model.generate( |
|
input_ids, |
|
images=image_tensor, |
|
max_new_tokens=100, |
|
use_cache=True)[0] |
|
|
|
print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip()) |
|
``` |
|
|
|
## Citation |
|
If you find this repository helpful, please cite the paper below. |
|
|
|
```bibtex |
|
@article{he2024bunny, |
|
title={Efficient Multimodal Learning from Data-centric Perspective}, |
|
author={He, Muyang and Liu, Yexin and Wu, Boya and Yuan, Jianhao and Wang, Yueze and Huang, Tiejun and Zhao, Bo}, |
|
journal={arXiv preprint arXiv:2402.11530}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
## License |
|
This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. |
|
The content of this project itself is licensed under the [cc-by-4.0](./LICENSE). |