Spaces:

opencompass
/

open_vlm_leaderboard

Running on CPU Upgrade

App Files Files Community

Discrepancy between listed and own accuracy of LLaVA-Onevision-7b-ov on BLINK benchmark

#11

by nicokossmann - opened Oct 4, 2024

Discussion

nicokossmann

Oct 4, 2024

•

edited Oct 4, 2024

First of all, your leaderboard is mega helpful.

I am currently experimenting with LLaVA-OneVision on the BLINK Benchmark. I took a closer look at the 0.5b-ov and 7b-ov of the latest transformers version on the BLINK benchmark from Huggingface.
Here I explicitly evaluated the subtasks: Visual_Correspondence, Visual_Similarity, Jigsaw and Multi-view_Reasoning on their test data by uploading my predictions to the BLINK-Benchmark Evaluation Challenge page.
I then compared the results with those on your leaderboard. Here are my results with a precision of fp16 for 7b-ov:

{
    "test": {
        "Visual_Similarity": 0.4632352941176471,
        "Counting": 0,
        "Relative_Depth": 0,
        "Jigsaw": 0.5533333333333333,
        "Art_Style": 0,
        "Functional_Correspondence": 0,
        "Semantic_Correspondence": 0,
        "Spatial_Relation": 0,
        "Object_Localization": 0,
        "Visual_Correspondence": 0.27325581395348836,
        "Multi-view_Reasoning": 0.5714285714285714,
        "Relative_Reflectance": 0,
        "Forensic_Detection": 0,
        "IQ_Test": 0,
        "Total": 0.1329466437737886
    }
}

The leaderboard shows:

"Visual_Similarity": 80,
"Jigsaw": 62.7,
"Visual_Correspondence": 47.7,
"Multi-view_Reasoning": 54.1

And for the 0.5b-ov version i've got with a precision of fp16:

{
    "test": {
        "Visual_Similarity": 0.4632352941176471,
        "Counting": 0,
        "Relative_Depth": 0,
        "Jigsaw": 0.56,
        "Art_Style": 0,
        "Functional_Correspondence": 0,
        "Semantic_Correspondence": 0,
        "Spatial_Relation": 0,
        "Object_Localization": 0,
        "Visual_Correspondence": 0.3023255813953488,
        "Multi-view_Reasoning": 0.47368421052631576,
        "Relative_Reflectance": 0,
        "Forensic_Detection": 0,
        "IQ_Test": 0,
        "Total": 0.12851750614566512
    }
}

These correspond to the accuracies of the leaderboard:

"Visual_Similarity": 47.4,
"Jigsaw": 52.7,
"Visual_Correspondence": 28.5,
"Multi-view_Reasoning": 45.1

What I also noticed is that the results of the leaderboard on overall accuracy do not match the results from the paper.
Paper (Table 4):

LLaVA-OV-0.5B: 52.1
LLaVA-OV-7B: 48.2

Leaderboard (Overall):

LLaVA-OV-0.5B: 40.1
LLaVA-OV-7B: 53

I don't want to rule out the possibility that this is a mistake on my part. However, it is very conspicuous.

KennyUTC

OpenCompass org Oct 8, 2024

Hi, @nicokossmann ,
Currently, VLMEvalKit supports the evaluation of BLINK VAL split, and we have released all VLM predictions in this huggingface dataset:
https://huggingface.co/datasets/VLMEval/OpenVLMRecords/tree/main.

You can build submission files based on our released prediction and try to submit to the official evaluation site again.

Also, I feel that the results released in the llava-onevision paper are weird: 0.5B worse than 7B on BLINK? I don't think that makes much sense.

nicokossmann

Oct 11, 2024

•

edited Oct 15, 2024

Thank you for your response @KennyUTC .
I have compared my results of the 7b model with yours. I have drawn the confusion matrices based on the multiple choice answers of the val set.
Once based on my predictions (first) and yours (second) for the subtasks described above, where the accuracy of each subtask matches the value of your leaderboard.
For Jigsaw:

For Multi-view Reasoning:

For Visual Correspondence:

For Visual Similarity:

I wanted to ask if you used the Github repo model or the Huggingface checkpoint for evaluation. Furthermore, I wanted to ask if you could evaluate the Hugging Face checkpoint (used for my predictions) on your side to check if it is a systematic error on my side.

nicokossmann changed discussion status to closed Oct 28, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment