File size: 2,334 Bytes
3c4e52a
8f206e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c4e52a
8f206e1
3c4e52a
8f206e1
48e5ce1
8f206e1
48e5ce1
8f206e1
48e5ce1
 
91d8edf
 
8f206e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb492d5
cc37c79
fb492d5
7da33c3
 
fb492d5
7da33c3
 
 
 
 
fb492d5
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
language:
  - multilingual
  - af
  - am
  - ar
  - as
  - az
  - be
  - bg
  - bn
  - br
  - bs
  - ca
  - cs
  - cy
  - da
  - de
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - fi
  - fr
  - fy
  - ga
  - gd
  - gl
  - gu
  - ha
  - he
  - hi
  - hr
  - hu
  - hy
  - id
  - is
  - it
  - ja
  - jv
  - ka
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lo
  - lt
  - lv
  - mg
  - mk
  - ml
  - mn
  - mr
  - ms
  - my
  - ne
  - nl
  - 'no'
  - om
  - or
  - pa
  - pl
  - ps
  - pt
  - ro
  - ru
  - sa
  - sd
  - si
  - sk
  - sl
  - so
  - sq
  - sr
  - su
  - sv
  - sw
  - ta
  - te
  - th
  - tl
  - tr
  - ug
  - uk
  - ur
  - uz
  - vi
  - xh
  - yi
  - zh
license: mit
pipeline_tag: feature-extraction
---

[xlm-roberta-base](https://huggingface.co/xlm-roberta-base) fine-tuned for sentence embeddings with [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552) (Gao et al., EMNLP 2021).

See a similar English model released by Gao et al.: https://huggingface.co/princeton-nlp/unsup-simcse-roberta-base.

Fine-tuning was done using the [reference implementation of unsupervised SimCSE](https://github.com/princeton-nlp/SimCSE) and the 1M sentences from English Wikipedia released by the authors.
As a sentence representation, we used the average of the last hidden states (`pooler_type=avg`), which is compatible with Sentence-BERT.

Fine-tuning command:
```bash
python train.py \
    --model_name_or_path xlm-roberta-base \
    --train_file data/wiki1m_for_simcse.txt \
    --output_dir unsup-simcse-xlm-roberta-base \
    --num_train_epochs 1 \
    --per_device_train_batch_size 32 \
    --gradient_accumulation_steps 16 \
    --learning_rate 1e-5 \
    --max_seq_length 128 \
    --pooler_type avg \
    --overwrite_output_dir \
    --temp 0.05 \
    --do_train \
    --fp16 \
    --seed 28852
```

## [Citation](https://arxiv.org/abs/2305.13303)
```bibtex
@inproceedings{vamvas-sennrich-2023-rsd,
      title={Towards Unsupervised Recognition of Token-level Semantic Differences in Related Documents},
      author={Jannis Vamvas and Rico Sennrich},
      month = dec,
      year = "2023",
      booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
      address = "Singapore",
      publisher = "Association for Computational Linguistics",
}
```