πŸ’‘Key Contributions:

  • LOVA3 - To the best of our knowledge, LOVA3 is the first effort to imbue the asking and assessment abilities in training a robust and intelligent MLLM, inspired from human learning mechanism.

  • EvalQABench - We build a new benchmark EvalQABench for the VQA correction evaluation as the first effort to advance the development of future research.

  • Performance Improvement - Training with our proposed LOVA3 framework, we observe consistent improvement on 10 representative benchmarks.

Model weight

Pretrained weight: LOVA3-llava-v1.5-7b

Download it by using following command:

git clone https://huggingface.co/hhenryz/LOVA3-llava-v1.5-7b

Training Data

Image Datasets

Please download the images from constituting datasets:

πŸ’ƒ Evaluation

  1. Download LOVA3-llava-v1.5-7b under the folder checkpoints.

  2. Download the CLIP vision encoder clip-vit-large-patch14-336 under the folder checkpoints

  3. Run the evaluation scripts under the folder scripts/v1_5/eval. There are 12 multimodal datasets and benchmarks awaiting evaluation.

Take VizWiz as an example, the running command is as follows:

modelname=LOVA3-llava-v1.5-7b

python -m llava.eval.model_vqa_loader \
    --model-path checkpoints/$modelname \
    --question-file ./playground/data/eval/vizwiz/llava_test.jsonl \
    --image-folder /yourpath/vizwiz/test/ \
    --answers-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
    --temperature 0 \
    --conv-mode vicuna_v1

python scripts/convert_vizwiz_for_submission.py \
    --annotation-file ./playground/data/eval/vizwiz/llava_test.jsonl \
    --result-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
    --result-upload-file ./playground/data/eval/vizwiz/answers_upload/$modelname.json

Training

  1. Download the pretrained MLP adapter weights llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5 from and put it under the folder checkpoints.

  2. Download the model weight clip-vit-large-patch14-336 under the folder checkpoints.

  3. Download the model weight vicuna-7b-v1.5 under the folder checkpoints.

  4. Download the training data Mixed_VQA_GenQA_EvalQA_1.5M.jsonl under the folder data.

  5. Run the training script.

bash scripts/v1_5/finetune.sh

πŸ™ Acknowledgement

  • LLaVA: The codebase we built upon.
  • LAVIS: We download some datasets from its scripts.

πŸŽ“ Citation

If you find LOVA3 useful, please cite using this BibTeX:

@inproceedings{
    zhao2024lova,
    title={{LOVA}3: Learning to Visual Question Answering, Asking and Assessment},
    author={Hengyuan Zhao and Pan Zhou and Difei Gao and Zechen Bai and Mike Zheng Shou},
    booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
    year={2024},
    url={https://openreview.net/forum?id=vIOKLMl6wu}
}

Downloads last month
46
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.