--- license: apache-2.0 language: - en metrics: - accuracy pipeline_tag: image-text-to-text --- # Introduction We use the powerful [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) to create a super small image-text-to-text model with only 296M params. The goal is to make it possible to run LLaVA models on edge devices (with few gigabytes of memory). For LLM and vision tower, we choose [OpenELM-270M-Instruct](apple/OpenELM-270M-Instruct) and [facebook/dinov2-small](facebook/dinov2-small), respectively. # Result For now we measured only [POPE](https://tinyllava-factory.readthedocs.io/en/latest/Evaluation.html#pope) with these results Category: adversarial, # samples: 3000 TP FP TN FN 1264 575 925 236 Accuracy: 0.7296666666666667 Precision: 0.6873300706905927 Recall: 0.8426666666666667 F1 score: 0.7571129080563043 Yes ratio: 0.613 0.757, 0.730, 0.687, 0.843, 0.613 ==================================== Category: popular, # samples: 3000 TP FP TN FN 1264 301 1199 236 Accuracy: 0.821 Precision: 0.807667731629393 Recall: 0.8426666666666667 F1 score: 0.8247960848287113 Yes ratio: 0.5216666666666666 0.825, 0.821, 0.808, 0.843, 0.522 ==================================== Category: random, # samples: 2910 TP FP TN FN 1264 290 1120 236 Accuracy: 0.8192439862542955 Precision: 0.8133848133848134 Recall: 0.8426666666666667 F1 score: 0.8277668631303209 Yes ratio: 0.534020618556701 0.828, 0.819, 0.813, 0.843, 0.534 ====================================