tarekziade commited on
Commit
2932b88
1 Parent(s): e1927d3

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +81 -78
README.md CHANGED
@@ -1,78 +1,81 @@
1
- ---
2
- tags:
3
- - image-to-text
4
- - image-captioning
5
- license: apache-2.0
6
- metrics:
7
- - rouge
8
- datasets:
9
- - nlphuji/flickr30k
10
- widget:
11
- - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
12
- example_title: Savanna
13
- - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
14
- example_title: Football Match
15
- - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
16
- example_title: Airport
17
- base_model:
18
- - google/vit-base-patch16-224-in21k
19
-
20
- model-index:
21
- - name: mozilla/distilvit
22
- results:
23
- - task:
24
- type: image-to-text
25
- name: Image To Text
26
- dataset:
27
- name: Mozilla/flickr30k-transformed-captions
28
- type: Mozilla/flickr30k-transformed-captions
29
- metrics:
30
- - name: ROUGE-1
31
- type: rouge
32
- value: 43.006
33
- verified: true
34
- - name: ROUGE-2
35
- type: rouge
36
- value: 16.9939
37
- verified: true
38
- - name: ROUGE-L
39
- type: rouge
40
- value: 38.8923
41
- verified: true
42
- - name: ROUGE-LSUM
43
- type: rouge
44
- value: 38.8877
45
- verified: true
46
- - name: loss
47
- type: loss
48
- value: 0.19939416646957397
49
- - name: gen_len
50
- type: gen_len
51
- value: 11.327256736227712
52
- verified: true
53
- ---
54
-
55
- # distilvit
56
-
57
- This model is a work in progress. Fine-tuned version of those base models:
58
-
59
- - a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k
60
- - a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2
61
-
62
- This model was trained on:
63
-
64
- - Flickr30k : https://huggingface.co/datasets/nlphuji/flickr30k
65
- - COCO 2017: https://cocodataset.org
66
-
67
- You can get that checkpoint using the 3083a3cef6e3c8dd90df3f088074bbe836b0f403 commit.
68
-
69
- It was then further fine-tuned on :
70
-
71
- - [Flickr30k debiased](https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions)
72
- - [DocOrNot](https://huggingface.co/datasets/Mozilla/docornot)
73
- - [Alt Text Validation](https://huggingface.co/datasets/Mozilla/alt-text-validation)
74
-
75
- For the latter, the dataset was annotated by our team to correct the alt text generated by the model,
76
- using the [checkvite tool](https://github.com/mozila/checkvite).
77
-
78
- You can find the code used to create the model here: https://github.com/mozilla/distilvit
 
 
 
 
1
+ ---
2
+ tags:
3
+ - image-to-text
4
+ - image-captioning
5
+ license: apache-2.0
6
+ metrics:
7
+ - rouge
8
+ datasets:
9
+ - nlphuji/flickr30k
10
+ widget:
11
+ - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
12
+ example_title: Savanna
13
+ - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
14
+ example_title: Football Match
15
+ - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
16
+ example_title: Airport
17
+ base_model:
18
+ - google/vit-base-patch16-224-in21k
19
+
20
+ model-index:
21
+ - name: mozilla/distilvit
22
+ results:
23
+ - task:
24
+ type: image-to-text
25
+ name: Image To Text
26
+ dataset:
27
+ name: nlphuji/flickr30k
28
+ type: nlphuji/flickr30k
29
+ metrics:
30
+ - name: ROUGE-1
31
+ type: rouge
32
+ value: 43.006
33
+ verified: true
34
+ - name: ROUGE-2
35
+ type: rouge
36
+ value: 16.9939
37
+ verified: true
38
+ - name: ROUGE-L
39
+ type: rouge
40
+ value: 38.8923
41
+ verified: true
42
+ - name: ROUGE-LSUM
43
+ type: rouge
44
+ value: 38.8877
45
+ verified: true
46
+ - name: loss
47
+ type: loss
48
+ value: 0.19939416646957397
49
+ - name: gen_len
50
+ type: gen_len
51
+ value: 11.327256736227712
52
+ verified: true
53
+ ---
54
+
55
+ # distilvit
56
+
57
+ This model is a work in progress. Fine-tuned version of those base models:
58
+
59
+ - a VIT model for the image encoder: https://huggingface.co/google/vit-base-patch16-224-in21k
60
+ - a Distilled GPT-2 model for the text decoder: https://huggingface.co/distilbert/distilgpt2
61
+
62
+ This model was trained on:
63
+
64
+ - Flickr30k : https://huggingface.co/datasets/nlphuji/flickr30k
65
+ - COCO 2017: https://cocodataset.org
66
+
67
+ You can get that checkpoint using the 3083a3cef6e3c8dd90df3f088074bbe836b0f403 commit.
68
+
69
+ It was then further fine-tuned on :
70
+
71
+ - Flickr30k debiased: https://huggingface.co/datasets/Mozilla/flickr30k-transformed-captions
72
+ - DocOrNot: https://huggingface.co/datasets/Mozilla/docornot
73
+
74
+ You can find the code used to create the model here: https://github.com/mozilla/distilvit
75
+
76
+ ### Framework versions
77
+
78
+ - Transformers 4.40.2
79
+ - Pytorch 2.3.0+cu121
80
+ - Datasets 2.19.1
81
+ - Tokenizers 0.19.1