BM-K commited on
Commit
8475d4d
โ€ข
1 Parent(s): 5fe91b0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +140 -0
README.md ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # KoDiffCSE
2
+ Difference-based Contrastive Learning for Korean Sentence Embeddings <br>
3
+ - [DiffCSE-[NAACL 2022]](https://arxiv.org/abs/2204.10298) <br>
4
+ - [[Github]](https://github.com/voidism/DiffCSE) Official implementation of DiffCSE <br>
5
+ <img src=https://user-images.githubusercontent.com/55969260/201829550-9674a3ac-cb9b-4e17-b777-7d96fdf5c633.png>
6
+
7
+ ## Quick tour
8
+ ```python
9
+ import torch
10
+ from transformers import AutoModel, AutoTokenizer
11
+
12
+ def cal_score(a, b):
13
+ if len(a.shape) == 1: a = a.unsqueeze(0)
14
+ if len(b.shape) == 1: b = b.unsqueeze(0)
15
+
16
+ a_norm = a / a.norm(dim=1)[:, None]
17
+ b_norm = b / b.norm(dim=1)[:, None]
18
+ return torch.mm(a_norm, b_norm.transpose(0, 1)) * 100
19
+
20
+ model = AutoModel.from_pretrained('BM-K/KoDiffCSE-RoBERTa')
21
+ tokenizer = AutoTokenizer.from_pretrained('BM-K/KoDiffCSE-RoBERTa')
22
+
23
+ sentences = ['์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.',
24
+ '์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค.',
25
+ '์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค.']
26
+
27
+ inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
28
+ embeddings, _ = model(**inputs, return_dict=False)
29
+
30
+ score01 = cal_score(embeddings[0][0], embeddings[1][0]) # --
31
+ # '์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.' @ '์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค.'
32
+ score02 = cal_score(embeddings[0][0], embeddings[2][0]) # --
33
+ # '์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.' @ '์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค.'
34
+ ```
35
+
36
+ ## Setups
37
+ [![Python](https://img.shields.io/badge/python-3.8.5-blue?logo=python&logoColor=FED643)](https://www.python.org/downloads/release/python-385/)
38
+ [![Pytorch](https://img.shields.io/badge/pytorch-1.7.1-red?logo=pytorch)](https://pytorch.org/get-started/previous-versions/)
39
+
40
+ ## Encoder Models
41
+ Baseline encoders used for korean sentence embedding - [KLUE-PLMs](https://github.com/KLUE-benchmark/KLUE/blob/main/README.md)
42
+
43
+ | Model | Embedding size | Hidden size | # Layers | # Heads |
44
+ |----------------------|----------------|-------------|----------|---------|
45
+ | KLUE-BERT-base | 768 | 768 | 12 | 12 |
46
+ | KLUE-RoBERTa-base | 768 | 768 | 12 | 12 |
47
+
48
+ > **Warning** <br>
49
+ > Large pre-trained models need a lot of GPU memory to train
50
+
51
+ ## Datasets
52
+ The data must exist in the "--path_to_data" folder
53
+ - [wiki-corpus](https://github.com/jeongukjae/korean-wikipedia-corpus) (Unsupervised Training)
54
+ - [KorSTS](https://github.com/kakaobrain/KorNLUDatasets) (Validation & Testing)
55
+
56
+ ## Training - unsupervised
57
+ ```
58
+ python main.py \
59
+ --model klue/roberta-base \
60
+ --generator_name klue/roberta-small \
61
+ --multi_gpu True \
62
+ --train True \
63
+ --test False \
64
+ --max_len 64 \
65
+ --batch_size 256 \
66
+ --epochs 1 \
67
+ --eval_steps 125 \
68
+ --lr 0.00005 \
69
+ --masking_ratio 0.15 \
70
+ --lambda_weight 0.005 \
71
+ --warmup_ratio 0.05 \
72
+ --temperature 0.05 \
73
+ --path_to_data Dataset/ \
74
+ --train_data wiki_corpus_examples.txt \
75
+ --valid_data valid_sts.tsv \
76
+ --ckpt best_checkpoint.pt
77
+ ```
78
+ ```
79
+ bash run_diff.sh
80
+ ```
81
+ > **Note** <br>
82
+ > Using roberta as an encoder is beneficial for training because the KoBERT model cannot build a small-sized generator.
83
+
84
+ ## Evaluation
85
+ ```
86
+ python main.py \
87
+ --model klue/roberta-base \
88
+ --generator klue/roberta-small \
89
+ --train False \
90
+ --test True \
91
+ --max_len 64 \
92
+ --batch_size 256 \
93
+ --path_to_data Dataset/ \
94
+ --test_data test_sts.tsv \
95
+ --path_to_saved_model output/best_checkpoint.pt
96
+ ```
97
+
98
+ ## Performance - unsupervised
99
+
100
+ | Model | Average | Cosine Pearson | Cosine Spearman | Euclidean Pearson | Euclidean Spearman | Manhattan Pearson | Manhattan Spearman | Dot Pearson | Dot Spearman |
101
+ |------------------------|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
102
+ | KoSRoBERTa-base<sup>โ€ </sup> | N/A | N/A | 48.96 | N/A | N/A | N/A | N/A | N/A | N/A |
103
+ | KoSRoBERTa-large<sup>โ€ </sup> | N/A | N/A | 51.35 | N/A | N/A | N/A | N/A | N/A | N/A |
104
+ | | | | | | | | | | |
105
+ | KoSimCSE-BERT | 74.08 | 74.92 | 73.98 | 74.15 | 74.22 | 74.07 | 74.07 | 74.15 | 73.14 |
106
+ | KoSimCSE-RoBERTa | 75.27 | 75.93 | 75.00 | 75.28 | 75.01 | 75.17 | 74.83 | 75.95 | 75.01 |
107
+ | | | | | | | | | | |
108
+ | KoDiffCSE-RoBERTa | 77.17 | 77.73 | 76.96 | 77.21 | 76.89 | 77.11 | 76.81 | 77.74 | 76.97 |
109
+
110
+ - [Korean-SRoBERTa<sup>โ€ </sup>](https://arxiv.org/abs/2004.03289)
111
+
112
+ ## License
113
+ This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
114
+
115
+ <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />
116
+
117
+ ## References
118
+
119
+ ```bibtex
120
+ @inproceedings{chuang2022diffcse,
121
+ title={{DiffCSE}: Difference-based Contrastive Learning for Sentence Embeddings},
122
+ author={Chuang, Yung-Sung and Dangovski, Rumen and Luo, Hongyin and Zhang, Yang and Chang, Shiyu and Soljacic, Marin and Li, Shang-Wen and Yih, Wen-tau and Kim, Yoon and Glass, James},
123
+ booktitle={Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
124
+ year={2022}
125
+ }
126
+ @misc{park2021klue,
127
+ title={KLUE: Korean Language Understanding Evaluation},
128
+ author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jungwoo Ha and Kyunghyun Cho},
129
+ year={2021},
130
+ eprint={2105.09680},
131
+ archivePrefix={arXiv},
132
+ primaryClass={cs.CL}
133
+ }
134
+ @article{ham2020kornli,
135
+ title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
136
+ author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
137
+ journal={arXiv preprint arXiv:2004.03289},
138
+ year={2020}
139
+ }
140
+ ```