π‘Key Contributions:
LOVA3 - To the best of our knowledge, LOVA3 is the first effort to imbue the asking and assessment abilities in training a robust and intelligent MLLM, inspired from human learning mechanism.
EvalQABench - We build a new benchmark EvalQABench for the VQA correction evaluation as the first effort to advance the development of future research.
Performance Improvement - Training with our proposed LOVA3 framework, we observe consistent improvement on 10 representative benchmarks.
Model weight
Pretrained weight: LOVA3-llava-v1.5-7b
Download it by using following command:
git clone https://huggingface.co/hhenryz/LOVA3-llava-v1.5-7b
Training Data
Here we provide the training/Evaluation/Testing sets of EvalQABench under the folder
EvalQABench
.Training data: Mixed_VQA_GenQA_EvalQA_1.5M.jsonl.
Image Datasets
Please download the images from constituting datasets:
- COCO: train2014
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg
- AOKVQA: download script
- TextVQA: train_val_images
- VisualGenome: part1, part2
- LLaVA-Instruct: huggingface
π Evaluation
Download LOVA3-llava-v1.5-7b under the folder
checkpoints
.Download the CLIP vision encoder clip-vit-large-patch14-336 under the folder
checkpoints
Run the evaluation scripts under the folder
scripts/v1_5/eval
. There are 12 multimodal datasets and benchmarks awaiting evaluation.
Take VizWiz as an example, the running command is as follows:
modelname=LOVA3-llava-v1.5-7b
python -m llava.eval.model_vqa_loader \
--model-path checkpoints/$modelname \
--question-file ./playground/data/eval/vizwiz/llava_test.jsonl \
--image-folder /yourpath/vizwiz/test/ \
--answers-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
--temperature 0 \
--conv-mode vicuna_v1
python scripts/convert_vizwiz_for_submission.py \
--annotation-file ./playground/data/eval/vizwiz/llava_test.jsonl \
--result-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
--result-upload-file ./playground/data/eval/vizwiz/answers_upload/$modelname.json
Training
Download the pretrained MLP adapter weights llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5 from and put it under the folder
checkpoints
.Download the model weight clip-vit-large-patch14-336 under the folder
checkpoints
.Download the model weight vicuna-7b-v1.5 under the folder
checkpoints
.Download the training data Mixed_VQA_GenQA_EvalQA_1.5M.jsonl under the folder
data
.Run the training script.
bash scripts/v1_5/finetune.sh
π Acknowledgement
π Citation
If you find LOVA3 useful, please cite using this BibTeX:
@inproceedings{
zhao2024lova,
title={{LOVA}3: Learning to Visual Question Answering, Asking and Assessment},
author={Hengyuan Zhao and Pan Zhou and Difei Gao and Zechen Bai and Mike Zheng Shou},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=vIOKLMl6wu}
}
- Downloads last month
- 46