File size: 2,334 Bytes
3c4e52a 8f206e1 3c4e52a 8f206e1 3c4e52a 8f206e1 48e5ce1 8f206e1 48e5ce1 8f206e1 48e5ce1 91d8edf 8f206e1 fb492d5 cc37c79 fb492d5 7da33c3 fb492d5 7da33c3 fb492d5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
---
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
license: mit
pipeline_tag: feature-extraction
---
[xlm-roberta-base](https://huggingface.co/xlm-roberta-base) fine-tuned for sentence embeddings with [SimCSE](http://dx.doi.org/10.18653/v1/2021.emnlp-main.552) (Gao et al., EMNLP 2021).
See a similar English model released by Gao et al.: https://huggingface.co/princeton-nlp/unsup-simcse-roberta-base.
Fine-tuning was done using the [reference implementation of unsupervised SimCSE](https://github.com/princeton-nlp/SimCSE) and the 1M sentences from English Wikipedia released by the authors.
As a sentence representation, we used the average of the last hidden states (`pooler_type=avg`), which is compatible with Sentence-BERT.
Fine-tuning command:
```bash
python train.py \
--model_name_or_path xlm-roberta-base \
--train_file data/wiki1m_for_simcse.txt \
--output_dir unsup-simcse-xlm-roberta-base \
--num_train_epochs 1 \
--per_device_train_batch_size 32 \
--gradient_accumulation_steps 16 \
--learning_rate 1e-5 \
--max_seq_length 128 \
--pooler_type avg \
--overwrite_output_dir \
--temp 0.05 \
--do_train \
--fp16 \
--seed 28852
```
## [Citation](https://arxiv.org/abs/2305.13303)
```bibtex
@inproceedings{vamvas-sennrich-2023-rsd,
title={Towards Unsupervised Recognition of Token-level Semantic Differences in Related Documents},
author={Jannis Vamvas and Rico Sennrich},
month = dec,
year = "2023",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
address = "Singapore",
publisher = "Association for Computational Linguistics",
}
```
|