johko commited on
Commit
31f9873
1 Parent(s): 894edea

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +37 -0
README.md ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: image-to-text
6
+ datasets:
7
+ - MS-COCO
8
+ - Flickr30k
9
+ tags:
10
+ - Image Captioning
11
+ ---
12
+
13
+ # CapDec - NoiseLevel: 0.001
14
+
15
+ ## Model Description
16
+
17
+ These are model weights originally provided by the authors of the paper [Text-Only Training for Image Captioning using Noise-Injected CLIP](https://arxiv.org/pdf/2211.00575.pdf).
18
+
19
+ Their method aims to train CLIP with only text samples. Therefore they are injecting zero-mean Gaussian Noise into the text embeddings before decoding.
20
+
21
+ In their words:
22
+ *Specifically, we assume that the visual embedding corresponding to a text embedding
23
+ lies somewhere within a ball of small radius around the text embedding (see Fig. 1).
24
+ We would like all text embeddings in this ball to decode to the same caption,which should
25
+ also correspond to the visual content mapped to this ball. We implement this intuition by
26
+ adding zero-mean Gaussian noise of STD to the text embedding before decoding it.*
27
+
28
+ The "Noise Level" of 0.001 is equivalent to the Noise Variance which is the square of the STD.
29
+
30
+ The reported metrics are results of a model with a Noise Variance of 0.016, which the authors unfortunately do not provide in their repository.
31
+
32
+ ## Datasets
33
+ The authors trained the model on MS-COCO and Flickr30k datasets.
34
+
35
+ ## Performance
36
+ The authors don't explicitly report the performance for this NoiseLevel but it can be estimated from the following figure from the original paper:
37
+ ![](capdec_performance.png)