--- tags: - image-to-text - image-captioning license: apache-2.0 metrics: - rouge datasets: - Mozilla/flickr30k-transformed-captions-gpt4o widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg example_title: Savanna - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg example_title: Football Match - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg example_title: Airport base_model: - google/vit-base-patch16-224-in21k --- # distilvit This model is a work in progress. Fine-tuned version of those base models: - a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k - a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2 This model was trained on: - [Flickr30k debiased](https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions-gpt4o) - [DocOrNot](https://huggingface.co/datasets/Mozilla/docornot) - [Alt Text Validation](https://huggingface.co/datasets/Mozilla/alt-text-validation) - A debiased version of COCO 2017: https://cocodataset.org You can find the code used to create the model here: https://github.com/mozilla/distilvit # training results - eval/gen_len 14.99729 - eval/loss 0.17093 - eval/meteor 0.51479 - eval/rouge1 57.8066 - eval/rouge2 35.0888 - eval/rougeL 52.9138 - eval/rougeLsum 52.9101 - eval/runtime 760.2135 - eval/samples_per_second 11.18 - eval/steps_per_second 0.112 - train/epoch 8.0 - train/global_step 11752 - train/learning_rate 0.0 - train/loss 0.1034 - train/total_flos 1.518634875573869e+20 - train/train_loss 0.14875 - train/train_runtime 91405.9053 - train/train_samples_per_second 12.855 - train/train_steps_per_second 0.129