metadata
tags:
- image-to-text
- image-captioning
license: apache-2.0
metrics:
- rouge
datasets:
- Mozilla/flickr30k-transformed-captions-gpt4o
widget:
- src: >-
https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
example_title: Savanna
- src: >-
https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
example_title: Football Match
- src: >-
https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
example_title: Airport
base_model:
- google/vit-base-patch16-224-in21k
distilvit
This model is a work in progress. Fine-tuned version of those base models:
- a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k
- a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2
This model was trained on:
- A debiased version of COCO 2017
- A debiased version of Flickr30k
- Images from pexels
- DocOrNot
- Alt Text Validation
You can find the code used to create the model here: https://github.com/mozilla/distilvit
training results
{
"train/loss": 0.0781,
"train/learning_rate": 0.00003793103448275862,
"train/epoch": 2.41,
"train/global_step": 700,
"eval/loss": 0.09741172194480896,
"eval/rouge1": 60.382,
"eval/rouge2": 38.0754,
"eval/rougeL": 56.9132,
"eval/rougeLsum": 56.9214,
"eval/meteor": 0.5448683804505693,
"eval/gen_len": 9.864678265672467,
"eval/runtime": 343.0443,
"eval/samples_per_second": 10.555,
"eval/steps_per_second": 0.108,
"train/train_runtime": 10567.9413,
"train/train_samples_per_second": 27.414,
"train/train_steps_per_second": 0.274,
"train/total_flos": 9039628706135409000,
"train/train_loss": 0.09852950266429356,
}