---
license: apache-2.0
language:
- en
metrics:
- accuracy
pipeline_tag: image-text-to-text
---

# Introduction

We use the powerful [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) to create a super small image-text-to-text model with only 296M params.

The goal is to make it possible to run LLaVA models on edge devices (with few gigabytes of memory).

For LLM and vision tower, we choose [OpenELM-270M-Instruct](apple/OpenELM-270M-Instruct) and [facebook/dinov2-small](facebook/dinov2-small), respectively.

# Result

For now we measured only [POPE](https://tinyllava-factory.readthedocs.io/en/latest/Evaluation.html#pope) with these results

Category: adversarial, # samples: 3000
TP      FP      TN      FN
1264    575     925     236
Accuracy: 0.7296666666666667
Precision: 0.6873300706905927
Recall: 0.8426666666666667
F1 score: 0.7571129080563043
Yes ratio: 0.613
0.757, 0.730, 0.687, 0.843, 0.613
====================================
Category: popular, # samples: 3000
TP      FP      TN      FN
1264    301     1199    236
Accuracy: 0.821
Precision: 0.807667731629393
Recall: 0.8426666666666667
F1 score: 0.8247960848287113
Yes ratio: 0.5216666666666666
0.825, 0.821, 0.808, 0.843, 0.522
====================================
Category: random, # samples: 2910
TP      FP      TN      FN
1264    290     1120    236
Accuracy: 0.8192439862542955
Precision: 0.8133848133848134
Recall: 0.8426666666666667
F1 score: 0.8277668631303209
Yes ratio: 0.534020618556701
0.828, 0.819, 0.813, 0.843, 0.534
====================================