sbrzz's picture
Update README.md
b9f867d verified
|
raw
history blame
1.56 kB
metadata
license: apache-2.0
language:
  - en
metrics:
  - accuracy
pipeline_tag: image-text-to-text

Introduction

We use the powerful TinyLLaVA Factory to create a super small image-text-to-text model with only 296M params.

The goal is to make it possible to run LLaVA models on edge devices (with few gigabytes of memory).

For LLM and vision tower, we choose OpenELM-270M-Instruct and facebook/dinov2-small, respectively.

Result

For now we measured only POPE with these results

Category: adversarial, # samples: 3000 TP FP TN FN 1264 575 925 236 Accuracy: 0.7296666666666667 Precision: 0.6873300706905927 Recall: 0.8426666666666667 F1 score: 0.7571129080563043 Yes ratio: 0.613 0.757, 0.730, 0.687, 0.843, 0.613

Category: popular, # samples: 3000 TP FP TN FN 1264 301 1199 236 Accuracy: 0.821 Precision: 0.807667731629393 Recall: 0.8426666666666667 F1 score: 0.8247960848287113 Yes ratio: 0.5216666666666666 0.825, 0.821, 0.808, 0.843, 0.522

Category: random, # samples: 2910 TP FP TN FN 1264 290 1120 236 Accuracy: 0.8192439862542955 Precision: 0.8133848133848134 Recall: 0.8426666666666667 F1 score: 0.8277668631303209 Yes ratio: 0.534020618556701 0.828, 0.819, 0.813, 0.843, 0.534