|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
pipeline_tag: image-text-to-text |
|
--- |
|
|
|
# Introduction |
|
|
|
We use the powerful [TinyLLaVA Factory](https://github.com/TinyLLaVA/TinyLLaVA_Factory) to create a super small image-text-to-text model with only 296M params. |
|
|
|
The goal is to make it possible to run LLaVA models on edge devices (with few gigabytes of memory). |
|
|
|
For LLM and vision tower, we choose [OpenELM-270M-Instruct](apple/OpenELM-270M-Instruct) and [facebook/dinov2-small](facebook/dinov2-small), respectively. |
|
|
|
# Result |
|
|
|
[POPE](https://tinyllava-factory.readthedocs.io/en/latest/Evaluation.html#pope): |
|
|
|
| Category | # Samples | TP | FP | TN | FN | Accuracy | Precision | Recall | F1 Score | Yes Ratio | |
|
|-------------|------------|------|-----|------|-----|----------|-----------|--------|----------|-----------| |
|
| Adversarial | 3000 | 1264 | 575 | 925 | 236 | 0.7297 | 0.6873 | 0.8427 | 0.7571 | 0.613 | |
|
| Popular | 3000 | 1264 | 301 | 1199 | 236 | 0.8210 | 0.8077 | 0.8427 | 0.8248 | 0.5217 | |
|
| Random | 2910 | 1264 | 290 | 1120 | 236 | 0.8192 | 0.8134 | 0.8427 | 0.8278 | 0.5340 | |
|
|
|
[TEXTVQA](https://tinyllava-factory.readthedocs.io/en/latest/Evaluation.html#textvqa) |
|
|
|
Samples 5000, Accuracy 27% |
|
|
|
[SCIENCEQA](https://tinyllava-factory.readthedocs.io/en/latest/Evaluation.html#scienceqa) |
|
|
|
Samples 4241, Correct: 1725, Accuracy: 40.64%, IMG-Accuracy: 36.54% |
|
|