Discrepancy between listed and own accuracy of LLaVA-Onevision-7b-ov on BLINK benchmark

#11
by nicokossmann - opened

First of all, your leaderboard is mega helpful.

I am currently experimenting with LLaVA-OneVision on the BLINK Benchmark. I took a closer look at the 0.5b-ov and 7b-ov of the latest transformers version on the BLINK benchmark from Huggingface.
Here I explicitly evaluated the subtasks: Visual_Correspondence, Visual_Similarity, Jigsaw and Multi-view_Reasoning on their test data by uploading my predictions to the BLINK-Benchmark Evaluation Challenge page.
I then compared the results with those on your leaderboard. Here are my results with a precision of fp16 for 7b-ov:

{
    "test": {
        "Visual_Similarity": 0.4632352941176471,
        "Counting": 0,
        "Relative_Depth": 0,
        "Jigsaw": 0.5533333333333333,
        "Art_Style": 0,
        "Functional_Correspondence": 0,
        "Semantic_Correspondence": 0,
        "Spatial_Relation": 0,
        "Object_Localization": 0,
        "Visual_Correspondence": 0.27325581395348836,
        "Multi-view_Reasoning": 0.5714285714285714,
        "Relative_Reflectance": 0,
        "Forensic_Detection": 0,
        "IQ_Test": 0,
        "Total": 0.1329466437737886
    }
}

The leaderboard shows:

  • "Visual_Similarity": 80,
  • "Jigsaw": 62.7,
  • "Visual_Correspondence": 47.7,
  • "Multi-view_Reasoning": 54.1

And for the 0.5b-ov version i've got with a precision of fp16:

{
    "test": {
        "Visual_Similarity": 0.4632352941176471,
        "Counting": 0,
        "Relative_Depth": 0,
        "Jigsaw": 0.56,
        "Art_Style": 0,
        "Functional_Correspondence": 0,
        "Semantic_Correspondence": 0,
        "Spatial_Relation": 0,
        "Object_Localization": 0,
        "Visual_Correspondence": 0.3023255813953488,
        "Multi-view_Reasoning": 0.47368421052631576,
        "Relative_Reflectance": 0,
        "Forensic_Detection": 0,
        "IQ_Test": 0,
        "Total": 0.12851750614566512
    }
}

These correspond to the accuracies of the leaderboard:

  • "Visual_Similarity": 47.4,
  • "Jigsaw": 52.7,
  • "Visual_Correspondence": 28.5,
  • "Multi-view_Reasoning": 45.1

What I also noticed is that the results of the leaderboard on overall accuracy do not match the results from the paper.
Paper (Table 4):

  • LLaVA-OV-0.5B: 52.1
  • LLaVA-OV-7B: 48.2

Leaderboard (Overall):

  • LLaVA-OV-0.5B: 40.1
  • LLaVA-OV-7B: 53

I don't want to rule out the possibility that this is a mistake on my part. However, it is very conspicuous.

OpenCompass org

Hi, @nicokossmann ,
Currently, VLMEvalKit supports the evaluation of BLINK VAL split, and we have released all VLM predictions in this huggingface dataset:
https://huggingface.co/datasets/VLMEval/OpenVLMRecords/tree/main.

You can build submission files based on our released prediction and try to submit to the official evaluation site again.

Also, I feel that the results released in the llava-onevision paper are weird: 0.5B worse than 7B on BLINK? I don't think that makes much sense.

Thank you for your response @KennyUTC .
I have compared my results of the 7b model with yours. I have drawn the confusion matrices based on the multiple choice answers of the val set.
Once based on my predictions (first) and yours (second) for the subtasks described above, where the accuracy of each subtask matches the value of your leaderboard.
For Jigsaw:
val_Jigsaw_confusion_matrix.png

val_Jigsaw_confusion_matrix_test.png

For Multi-view Reasoning:
val_Multi-view_Reasoning_confusion_matrix.png

val_Multi-view_Reasoning_confusion_matrix_test.png

For Visual Correspondence:
val_Visual_Correspondence_confusion_matrix.png

val_Visual_Correspondence_confusion_matrix_test.png

For Visual Similarity:
val_Visual_Similarity_confusion_matrix.png

val_Visual_Similarity_confusion_matrix_test.png

I wanted to ask if you used the Github repo model or the Huggingface checkpoint for evaluation. Furthermore, I wanted to ask if you could evaluate the Hugging Face checkpoint (used for my predictions) on your side to check if it is a systematic error on my side.

nicokossmann changed discussion status to closed

Sign up or log in to comment