hhenryz/LOVA3-llava-v1.5-7b

💡Key Contributions:

LOVA3 - To the best of our knowledge, LOVA3 is the first effort to imbue the asking and assessment abilities in training a robust and intelligent MLLM, inspired from human learning mechanism.
EvalQABench - We build a new benchmark EvalQABench for the VQA correction evaluation as the first effort to advance the development of future research.
Performance Improvement - Training with our proposed LOVA3 framework, we observe consistent improvement on 10 representative benchmarks.

Model weight

Download it by using following command:

git clone https://huggingface.co/hhenryz/LOVA3-llava-v1.5-7b

Training Data

Here we provide the training/Evaluation/Testing sets of EvalQABench under the folder EvalQABench.
Training data: Mixed_VQA_GenQA_EvalQA_1.5M.jsonl.

Image Datasets

Please download the images from constituting datasets:

COCO: train2014
GQA: images
OCR-VQA: download script, we save all files as .jpg
AOKVQA: download script
TextVQA: train_val_images
VisualGenome: part1, part2
LLaVA-Instruct: huggingface

💃 Evaluation

Download LOVA3-llava-v1.5-7b under the folder checkpoints.
Download the CLIP vision encoder clip-vit-large-patch14-336 under the folder checkpoints
Run the evaluation scripts under the folder scripts/v1_5/eval. There are 12 multimodal datasets and benchmarks awaiting evaluation.

Take VizWiz as an example, the running command is as follows:

modelname=LOVA3-llava-v1.5-7b

python -m llava.eval.model_vqa_loader \
    --model-path checkpoints/$modelname \
    --question-file ./playground/data/eval/vizwiz/llava_test.jsonl \
    --image-folder /yourpath/vizwiz/test/ \
    --answers-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
    --temperature 0 \
    --conv-mode vicuna_v1

python scripts/convert_vizwiz_for_submission.py \
    --annotation-file ./playground/data/eval/vizwiz/llava_test.jsonl \
    --result-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
    --result-upload-file ./playground/data/eval/vizwiz/answers_upload/$modelname.json

Training

Download the pretrained MLP adapter weights llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5 from and put it under the folder checkpoints.
Download the model weight clip-vit-large-patch14-336 under the folder checkpoints.
Download the model weight vicuna-7b-v1.5 under the folder checkpoints.
Download the training data Mixed_VQA_GenQA_EvalQA_1.5M.jsonl under the folder data.
Run the training script.

bash scripts/v1_5/finetune.sh

🙏 Acknowledgement

LLaVA: The codebase we built upon.
LAVIS: We download some datasets from its scripts.

🎓 Citation

If you find LOVA3 useful, please cite using this BibTeX:

@inproceedings{
    zhao2024lova,
    title={{LOVA}3: Learning to Visual Question Answering, Asking and Assessment},
    author={Hengyuan Zhao and Pan Zhou and Difei Gao and Zechen Bai and Mike Zheng Shou},
    booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
    year={2024},
    url={https://openreview.net/forum?id=vIOKLMl6wu}
}