Mozilla
/

distilvit

@@ -1,94 +1,60 @@
----
-tags:
-  - image-to-text
-  - image-captioning
-license: apache-2.0
-metrics:
-  - rouge
-datasets:
-  - Mozilla/flickr30k-transformed-captions
-widget:
-  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
-    example_title: Savanna
-  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
-    example_title: Football Match
-  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
-    example_title: Airport
-base_model:
-  - google/vit-base-patch16-224-in21k
-model-index:
-  - name: mozilla/distilvit
-    results:
-      - task:
-          type: image-to-text
-          name: Image To Text
-        dataset:
-          name: Mozilla/flickr30k-transformed-captions
-          type: Mozilla/flickr30k-transformed-captions
-        metrics:
-          - name: ROUGE-1
-            type: rouge
-            value: 43.006
-            verified: true
-          - name: ROUGE-2
-            type: rouge
-            value: 16.9939
-            verified: true
-          - name: ROUGE-L
-            type: rouge
-            value: 38.8923
-            verified: true
-          - name: ROUGE-LSUM
-            type: rouge
-            value: 38.8877
-            verified: true
-          - name: loss
-            type: loss
-            value: 0.19939416646957397
-          - name: gen_len
-            type: gen_len
-            value: 11.327256736227712
-            verified: true
----
-# distilvit
-This model is a work in progress. Fine-tuned version of those base models:
-- a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k
-- a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2
-This model was trained on:
-- [Flickr30k debiased](https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions-gpt4o)
-- [DocOrNot](https://huggingface.co/datasets/Mozilla/docornot)
-- [Alt Text Validation](https://huggingface.co/datasets/Mozilla/alt-text-validation)
-- A debiased version of COCO 2017: https://cocodataset.org
-You can find the code used to create the model here: https://github.com/mozilla/distilvit
-# training results
-- eval/gen_len 14.99729
-- eval/loss 0.17093
-- eval/meteor 0.51479
-- eval/rouge1 57.8066
-- eval/rouge2 35.0888
-- eval/rougeL 52.9138
-- eval/rougeLsum 52.9101
-- eval/runtime 760.2135
-- eval/samples_per_second 11.18
-- eval/steps_per_second 0.112
-- train/epoch 8.0
-- train/global_step 11752
-- train/learning_rate 0.0
-- train/loss 0.1034
-- train/total_flos 1.518634875573869e+20
-- train/train_loss 0.14875
-- train/train_runtime 91405.9053
-- train/train_samples_per_second 12.855
-- train/train_steps_per_second 0.129

+---
+tags:
+  - image-to-text
+  - image-captioning
+license: apache-2.0
+metrics:
+  - rouge
+datasets:
+  - Mozilla/flickr30k-transformed-captions-gpt4o
+widget:
+  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
+    example_title: Savanna
+  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
+    example_title: Football Match
+  - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
+    example_title: Airport
+base_model:
+  - google/vit-base-patch16-224-in21k
+---
+# distilvit
+This model is a work in progress. Fine-tuned version of those base models:
+- a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k
+- a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2
+This model was trained on:
+- [Flickr30k debiased](https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions-gpt4o)
+- [DocOrNot](https://huggingface.co/datasets/Mozilla/docornot)
+- [Alt Text Validation](https://huggingface.co/datasets/Mozilla/alt-text-validation)
+- A debiased version of COCO 2017: https://cocodataset.org
+You can find the code used to create the model here: https://github.com/mozilla/distilvit
+# training results
+- eval/gen_len 14.99729
+- eval/loss 0.17093
+- eval/meteor 0.51479
+- eval/rouge1 57.8066
+- eval/rouge2 35.0888
+- eval/rougeL 52.9138
+- eval/rougeLsum 52.9101
+- eval/runtime 760.2135
+- eval/samples_per_second 11.18
+- eval/steps_per_second 0.112
+- train/epoch 8.0
+- train/global_step 11752
+- train/learning_rate 0.0
+- train/loss 0.1034
+- train/total_flos 1.518634875573869e+20
+- train/train_loss 0.14875
+- train/train_runtime 91405.9053
+- train/train_samples_per_second 12.855
+- train/train_steps_per_second 0.129