beomi commited on
Commit
4431b6c
โ€ข
1 Parent(s): 42204d4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +248 -3
README.md CHANGED
@@ -1,3 +1,248 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ - en
5
+ tags:
6
+ - electra
7
+ - korean
8
+ license: "mit"
9
+ ---
10
+
11
+
12
+ # KcELECTRA: Korean comments ELECTRA
13
+
14
+ ** Updates on 2022.10.08 **
15
+
16
+ - KcELECTRA-base-v2022 (๊ตฌ v2022-dev) ๋ชจ๋ธ ์ด๋ฆ„์ด ๋ณ€๊ฒฝ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
17
+ - ์œ„ ๋ชจ๋ธ์˜ ์„ธ๋ถ€ ์Šค์ฝ”์–ด๋ฅผ ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
18
+ - ๊ธฐ์กด KcELECTRA-base(v2021) ๋Œ€๋น„ ๋Œ€๋ถ€๋ถ„์˜ downstream task์—์„œ ~1%p ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
19
+
20
+ ---
21
+
22
+ ๊ณต๊ฐœ๋œ ํ•œ๊ตญ์–ด Transformer ๊ณ„์—ด ๋ชจ๋ธ๋“ค์€ ๋Œ€๋ถ€๋ถ„ ํ•œ๊ตญ์–ด ์œ„ํ‚ค, ๋‰ด์Šค ๊ธฐ์‚ฌ, ์ฑ… ๋“ฑ ์ž˜ ์ •์ œ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ํ•œํŽธ, ์‹ค์ œ๋กœ NSMC์™€ ๊ฐ™์€ User-Generated Noisy text domain ๋ฐ์ดํ„ฐ์…‹์€ ์ •์ œ๋˜์ง€ ์•Š์•˜๊ณ  ๊ตฌ์–ด์ฒด ํŠน์ง•์— ์‹ ์กฐ์–ด๊ฐ€ ๋งŽ์œผ๋ฉฐ, ์˜คํƒˆ์ž ๋“ฑ ๊ณต์‹์ ์ธ ๊ธ€์“ฐ๊ธฐ์—์„œ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๋Š” ํ‘œํ˜„๋“ค์ด ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค.
23
+
24
+ KcELECTRA๋Š” ์œ„์™€ ๊ฐ™์€ ํŠน์„ฑ์˜ ๋ฐ์ดํ„ฐ์…‹์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด, ๋„ค์ด๋ฒ„ ๋‰ด์Šค์—์„œ ๋Œ“๊ธ€๊ณผ ๋Œ€๋Œ“๊ธ€์„ ์ˆ˜์ง‘ํ•ด, ํ† ํฌ๋‚˜์ด์ €์™€ ELECTRA๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•œ Pretrained ELECTRA ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
25
+
26
+ ๊ธฐ์กด KcBERT ๋Œ€๋น„ ๋ฐ์ดํ„ฐ์…‹ ์ฆ๊ฐ€ ๋ฐ vocab ํ™•์žฅ์„ ํ†ตํ•ด ์ƒ๋‹นํ•œ ์ˆ˜์ค€์œผ๋กœ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
27
+
28
+ KcELECTRA๋Š” Huggingface์˜ Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ๊ฐ„ํŽธํžˆ ๋ถˆ๋Ÿฌ์™€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (๋ณ„๋„์˜ ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.)
29
+
30
+ ```
31
+ ๐Ÿ’ก NOTE ๐Ÿ’ก
32
+ General Corpus๋กœ ํ•™์Šตํ•œ KoELECTRA๊ฐ€ ๋ณดํŽธ์ ์ธ task์—์„œ๋Š” ์„ฑ๋Šฅ์ด ๋” ์ž˜ ๋‚˜์˜ฌ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค.
33
+ KcBERT/KcELECTRA๋Š” User genrated, Noisy text์— ๋Œ€ํ•ด์„œ ๋ณด๋‹ค ์ž˜ ๋™์ž‘ํ•˜๋Š” PLM์ž…๋‹ˆ๋‹ค.
34
+ ```
35
+
36
+ ## KcELECTRA Performance
37
+
38
+ - Finetune ์ฝ”๋“œ๋Š” https://github.com/Beomi/KcBERT-finetune ์—์„œ ์ฐพ์•„๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
39
+ - ํ•ด๋‹น Repo์˜ ๊ฐ Checkpoint ํด๋”์—์„œ Step๋ณ„ ์„ธ๋ถ€ ์Šค์ฝ”์–ด๋ฅผ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
40
+
41
+ | | Size<br/>(์šฉ๋Ÿ‰) | **NSMC**<br/>(acc) | **Naver NER**<br/>(F1) | **PAWS**<br/>(acc) | **KorNLI**<br/>(acc) | **KorSTS**<br/>(spearman) | **Question Pair**<br/>(acc) | **KorQuaD (Dev)**<br/>(EM/F1) |
42
+ | :----------------- | :-------------: | :----------------: | :--------------------: | :----------------: | :------------------: | :-----------------------: | :-------------------------: | :---------------------------: |
43
+ | **KcELECTRA-base-v2022** | 475M | **91.97** | 87.35 | 76.50 | 82.12 | 83.67 | 95.12 | 69.00 / 90.40 |
44
+ | **KcELECTRA-base** | 475M | 91.71 | 86.90 | 74.80 | 81.65 | 82.65 | **95.78** | 70.60 / 90.11 |
45
+ | KcBERT-Base | 417M | 89.62 | 84.34 | 66.95 | 74.85 | 75.57 | 93.93 | 60.25 / 84.39 |
46
+ | KcBERT-Large | 1.2G | 90.68 | 85.53 | 70.15 | 76.99 | 77.49 | 94.06 | 62.16 / 86.64 |
47
+ | KoBERT | 351M | 89.63 | 86.11 | 80.65 | 79.00 | 79.64 | 93.93 | 52.81 / 80.27 |
48
+ | XLM-Roberta-Base | 1.03G | 89.49 | 86.26 | 82.95 | 79.92 | 79.09 | 93.53 | 64.70 / 88.94 |
49
+ | HanBERT | 614M | 90.16 | 87.31 | 82.40 | 80.89 | 83.33 | 94.19 | 78.74 / 92.02 |
50
+ | KoELECTRA-Base | 423M | 90.21 | 86.87 | 81.90 | 80.85 | 83.21 | 94.20 | 61.10 / 89.59 |
51
+ | KoELECTRA-Base-v2 | 423M | 89.70 | 87.02 | 83.90 | 80.61 | 84.30 | 94.72 | 84.34 / 92.58 |
52
+ | KoELECTRA-Base-v3 | 423M | 90.63 | **88.11** | **84.45** | **82.24** | **85.53** | 95.25 | **84.83 / 93.45** |
53
+ | DistilKoBERT | 108M | 88.41 | 84.13 | 62.55 | 70.55 | 73.21 | 92.48 | 54.12 / 77.80 |
54
+
55
+
56
+ \*HanBERT์˜ Size๋Š” Bert Model๊ณผ Tokenizer DB๋ฅผ ํ•ฉ์นœ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
57
+
58
+ \***config์˜ ์„ธํŒ…์„ ๊ทธ๋Œ€๋กœ ํ•˜์—ฌ ๋Œ๋ฆฐ ๊ฒฐ๊ณผ์ด๋ฉฐ, hyperparameter tuning์„ ์ถ”๊ฐ€์ ์œผ๋กœ ํ•  ์‹œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์ด ๋‚˜์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.**
59
+
60
+ ## How to use
61
+
62
+ ### Requirements
63
+
64
+ - `pytorch ~= 1.8.0`
65
+ - `transformers ~= 4.11.3`
66
+ - `emoji ~= 0.6.0`
67
+ - `soynlp ~= 0.0.493`
68
+
69
+ ### Default usage
70
+
71
+ ```python
72
+ from transformers import AutoTokenizer, AutoModel
73
+
74
+ tokenizer = AutoTokenizer.from_pretrained("beomi/KcELECTRA-base")
75
+ model = AutoModel.from_pretrained("beomi/KcELECTRA-base")
76
+ ```
77
+
78
+ > ๐Ÿ’ก ์ด์ „ KcBERT ๊ด€๋ จ ์ฝ”๋“œ๋“ค์—์„œ `AutoTokenizer`, `AutoModel` ์„ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ `.from_pretrained("beomi/kcbert-base")` ๋ถ€๋ถ„์„ `.from_pretrained("beomi/KcELECTRA-base")` ๋กœ๋งŒ ๋ณ€๊ฒฝํ•ด์ฃผ์‹œ๋ฉด ์ฆ‰์‹œ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
79
+
80
+ ### Pretrain & Finetune Colab ๋งํฌ ๋ชจ์Œ
81
+
82
+ #### Pretrain Data
83
+
84
+ - KcBERTํ•™์Šต์— ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ + ์ดํ›„ 2021.03์›” ์ดˆ๊นŒ์ง€ ์ˆ˜์ง‘ํ•œ ๋Œ“๊ธ€
85
+ - ์•ฝ 17GB
86
+ - ๋Œ“๊ธ€-๋Œ€๋Œ“๊ธ€์„ ๋ฌถ์€ ๊ธฐ๋ฐ˜์œผ๋กœ Document ๊ตฌ์„ฑ
87
+
88
+ #### Pretrain Code
89
+
90
+ - https://github.com/KLUE-benchmark/KLUE-ELECTRA Repo๋ฅผ ํ†ตํ•œ Pretrain
91
+
92
+ #### Finetune Code
93
+
94
+ - https://github.com/Beomi/KcBERT-finetune Repo๋ฅผ ํ†ตํ•œ Finetune ๋ฐ ์Šค์ฝ”์–ด ๋น„๊ต
95
+
96
+ #### Finetune Samples
97
+
98
+ - NSMC with PyTorch-Lightning 1.3.0, GPU, Colab <a href="https://colab.research.google.com/drive/1Hh63kIBAiBw3Hho--BvfdUWLu-ysMFF0?usp=sharing">
99
+ <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
100
+ </a>
101
+
102
+
103
+ ## Train Data & Preprocessing
104
+
105
+ ### Raw Data
106
+
107
+ ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” 2019.01.01 ~ 2021.03.09 ์‚ฌ์ด์— ์ž‘์„ฑ๋œ **๋Œ“๊ธ€ ๋งŽ์€ ๋‰ด์Šค/ํ˜น์€ ์ „์ฒด ๋‰ด์Šค** ๊ธฐ์‚ฌ๋“ค์˜ **๋Œ“๊ธ€๊ณผ ๋Œ€๋Œ“๊ธ€**์„ ๋ชจ๋‘ ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.
108
+
109
+ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ๋Š” ํ…์ŠคํŠธ๋งŒ ์ถ”์ถœ์‹œ **์•ฝ 17.3GB์ด๋ฉฐ, 1์–ต8์ฒœ๋งŒ๊ฐœ ์ด์ƒ์˜ ๋ฌธ์žฅ**์œผ๋กœ ์ด๋ค„์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.
110
+
111
+ > KcBERT๋Š” 2019.01-2020.06์˜ ํ…์ŠคํŠธ๋กœ, ์ •์ œ ํ›„ ์•ฝ 9์ฒœ๋งŒ๊ฐœ ๋ฌธ์žฅ์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.
112
+
113
+ ### Preprocessing
114
+
115
+ PLM ํ•™์Šต์„ ์œ„ํ•ด์„œ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ง„ํ–‰ํ•œ ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
116
+
117
+ 1. ํ•œ๊ธ€ ๋ฐ ์˜์–ด, ํŠน์ˆ˜๋ฌธ์ž, ๊ทธ๋ฆฌ๊ณ  ์ด๋ชจ์ง€(๐Ÿฅณ)๊นŒ์ง€!
118
+
119
+ ์ •๊ทœํ‘œํ˜„์‹์„ ํ†ตํ•ด ํ•œ๊ธ€, ์˜์–ด, ํŠน์ˆ˜๋ฌธ์ž๋ฅผ ํฌํ•จํ•ด Emoji๊นŒ์ง€ ํ•™์Šต ๋Œ€์ƒ์— ํฌํ•จํ–ˆ์Šต๋‹ˆ๋‹ค.
120
+
121
+ ํ•œํŽธ, ํ•œ๊ธ€ ๋ฒ”์œ„๋ฅผ `ใ„ฑ-ใ…Ž๊ฐ€-ํžฃ` ์œผ๋กœ ์ง€์ •ํ•ด `ใ„ฑ-ํžฃ` ๋‚ด์˜ ํ•œ์ž๋ฅผ ์ œ์™ธํ–ˆ์Šต๋‹ˆ๋‹ค.
122
+
123
+ 2. ๋Œ“๊ธ€ ๋‚ด ์ค‘๋ณต ๋ฌธ์ž์—ด ์ถ•์•ฝ
124
+
125
+ `ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹`์™€ ๊ฐ™์ด ์ค‘๋ณต๋œ ๊ธ€์ž๋ฅผ `ใ…‹ใ…‹`์™€ ๊ฐ™์€ ๊ฒƒ์œผ๋กœ ํ•ฉ์ณค์Šต๋‹ˆ๋‹ค.
126
+
127
+ 3. Cased Model
128
+
129
+ KcBERT๋Š” ์˜๋ฌธ์— ๋Œ€ํ•ด์„œ๋Š” ๋Œ€์†Œ๋ฌธ์ž๋ฅผ ์œ ์ง€ํ•˜๋Š” Cased model์ž…๋‹ˆ๋‹ค.
130
+
131
+ 4. ๊ธ€์ž ๋‹จ์œ„ 10๊ธ€์ž ์ดํ•˜ ์ œ๊ฑฐ
132
+
133
+ 10๊ธ€์ž ๋ฏธ๋งŒ์˜ ํ…์ŠคํŠธ๋Š” ๋‹จ์ผ ๋‹จ์–ด๋กœ ์ด๋ค„์ง„ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„ ํ•ด๋‹น ๋ถ€๋ถ„์„ ์ œ์™ธํ–ˆ์Šต๋‹ˆ๋‹ค.
134
+
135
+ 5. ์ค‘๋ณต ์ œ๊ฑฐ
136
+
137
+ ์ค‘๋ณต์ ์œผ๋กœ ์“ฐ์ธ ๋Œ“๊ธ€์„ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด ์™„์ „ํžˆ ์ผ์น˜ํ•˜๋Š” ์ค‘๋ณต ๋Œ“๊ธ€์„ ํ•˜๋‚˜๋กœ ํ•ฉ์ณค์Šต๋‹ˆ๋‹ค.
138
+
139
+ 6. `OOO` ์ œ๊ฑฐ
140
+
141
+ ๋„ค์ด๋ฒ„ ๋Œ“๊ธ€์˜ ๊ฒฝ์šฐ, ๋น„์†์–ด๋Š” ์ž์ฒด ํ•„ํ„ฐ๋ง์„ ํ†ตํ•ด `OOO` ๋กœ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ถ€๋ถ„์„ ๊ณต๋ฐฑ์œผ๋กœ ์ œ๊ฑฐํ•˜์˜€์Šต๋‹ˆ๋‹ค.
142
+
143
+ ์•„๋ž˜ ๋ช…๋ น์–ด๋กœ pip๋กœ ์„ค์น˜ํ•œ ๋’ค, ์•„๋ž˜ cleanํ•จ์ˆ˜๋กœ ํด๋ฆฌ๋‹์„ ํ•˜๋ฉด Downstream task์—์„œ ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์•„์ง‘๋‹ˆ๋‹ค. (`[UNK]` ๊ฐ์†Œ)
144
+
145
+ ```bash
146
+ pip install soynlp emoji
147
+ ```
148
+
149
+ ์•„๋ž˜ `clean` ํ•จ์ˆ˜๋ฅผ Text data์— ์‚ฌ์šฉํ•ด์ฃผ์„ธ์š”.
150
+
151
+ ```python
152
+ import re
153
+ import emoji
154
+ from soynlp.normalizer import repeat_normalize
155
+
156
+ emojis = ''.join(emoji.UNICODE_EMOJI.keys())
157
+ pattern = re.compile(f'[^ .,?!/@$%~๏ผ…ยทโˆผ()\x00-\x7Fใ„ฑ-ใ…ฃ๊ฐ€-ํžฃ{emojis}]+')
158
+ url_pattern = re.compile(
159
+ r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)')
160
+
161
+ import re
162
+ import emoji
163
+ from soynlp.normalizer import repeat_normalize
164
+
165
+ pattern = re.compile(f'[^ .,?!/@$%~๏ผ…ยทโˆผ()\x00-\x7Fใ„ฑ-ใ…ฃ๊ฐ€-ํžฃ]+')
166
+ url_pattern = re.compile(
167
+ r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)')
168
+
169
+ def clean(x):
170
+ x = pattern.sub(' ', x)
171
+ x = emoji.replace_emoji(x, replace='') #emoji ์‚ญ์ œ
172
+ x = url_pattern.sub('', x)
173
+ x = x.strip()
174
+ x = repeat_normalize(x, num_repeats=2)
175
+ return x
176
+ ```
177
+
178
+ > ๐Ÿ’ก Finetune Score์—์„œ๋Š” ์œ„ `clean` ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.
179
+
180
+ ### Cleaned Data
181
+
182
+ - KcBERT ์™ธ ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ๋Š” ์ •๋ฆฌ ํ›„ ๊ณต๊ฐœ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.
183
+
184
+
185
+ ## Tokenizer, Model Train
186
+
187
+ Tokenizer๋Š” Huggingface์˜ [Tokenizers](https://github.com/huggingface/tokenizers) ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.
188
+
189
+ ๊ทธ ์ค‘ `BertWordPieceTokenizer` ๋ฅผ ์ด์šฉํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , Vocab Size๋Š” `30000`์œผ๋กœ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.
190
+
191
+ Tokenizer๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์—๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , ๋ชจ๋ธ์˜ General Downstream task์— ๋Œ€์‘ํ•˜๊ธฐ ์œ„ํ•ด KoELECTRA์—์„œ ์‚ฌ์šฉํ•œ Vocab์„ ๊ฒน์น˜์ง€ ์•Š๋Š” ๋ถ€๋ถ„์„ ์ถ”๊ฐ€๋กœ ๋„ฃ์–ด์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. (์‹ค์ œ๋กœ ๋‘ ๋ชจ๋ธ์ด ๊ฒน์น˜๋Š” ๋ถ€๋ถ„์€ ์•ฝ 5000ํ† ํฐ์ด์—ˆ์Šต๋‹ˆ๋‹ค.)
192
+
193
+ TPU `v3-8` ์„ ์ด์šฉํ•ด ์•ฝ 10์ผ ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , ํ˜„์žฌ Huggingface์— ๊ณต๊ฐœ๋œ ๋ชจ๋ธ์€ 848k step์„ ํ•™์Šตํ•œ ๋ชจ๋ธ weight๊ฐ€ ์—…๋กœ๋“œ ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค.
194
+
195
+ (100k step๋ณ„ Checkpoint๋ฅผ ํ†ตํ•ด ์„ฑ๋Šฅ ํ‰๊ฐ€๋ฅผ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ถ€๋ถ„์€ `KcBERT-finetune` repo๋ฅผ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”.)
196
+
197
+ ๋ชจ๋ธ ํ•™๏ฟฝ๏ฟฝ๏ฟฝ Loss๋Š” Step์— ๋”ฐ๋ผ ์ดˆ๊ธฐ 100-200k ์‚ฌ์ด์— ๊ธ‰๊ฒฉํžˆ Loss๊ฐ€ ์ค„์–ด๋“ค๋‹ค ํ•™์Šต ์ข…๋ฃŒ๊นŒ์ง€๋„ ์ง€์†์ ์œผ๋กœ loss๊ฐ€ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
198
+
199
+ ![KcELECTRA-base Pretrain Loss](https://cdn.jsdelivr.net/gh/beomi/blog-img@master/2021/04/07/image-20210407201231133.png)
200
+
201
+ ### KcELECTRA Pretrain Step๋ณ„ Downstream task ์„ฑ๋Šฅ ๋น„๊ต
202
+
203
+ > ๐Ÿ’ก ์•„๋ž˜ ํ‘œ๋Š” ์ „์ฒด ckpt๊ฐ€ ์•„๋‹Œ ์ผ๋ถ€์— ๋Œ€ํ•ด์„œ๋งŒ ํ…Œ์ŠคํŠธ๋ฅผ ์ง„ํ–‰ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.
204
+
205
+ ![KcELECTRA Pretrain Step๋ณ„ Downstream task ์„ฑ๋Šฅ ๋น„๊ต](https://cdn.jsdelivr.net/gh/beomi/blog-img@master/2021/04/07/image-20210407215557039.png)
206
+
207
+ - ์œ„์™€ ๊ฐ™์ด KcBERT-base, KcBERT-large ๋Œ€๋น„ **๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด** KcELECTRA-base๊ฐ€ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
208
+ - KcELECTRA pretrain์—์„œ๋„ Train step์ด ๋Š˜์–ด๊ฐ์— ๋”ฐ๋ผ ์ ์ง„์ ์œผ๋กœ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
209
+
210
+ ## ์ธ์šฉํ‘œ๊ธฐ/Citation
211
+
212
+ KcELECTRA๋ฅผ ์ธ์šฉํ•˜์‹ค ๋•Œ๋Š” ์•„๋ž˜ ์–‘์‹์„ ํ†ตํ•ด ์ธ์šฉํ•ด์ฃผ์„ธ์š”.
213
+
214
+ ```
215
+ @misc{lee2021kcelectra,
216
+ author = {Junbum Lee},
217
+ title = {KcELECTRA: Korean comments ELECTRA},
218
+ year = {2021},
219
+ publisher = {GitHub},
220
+ journal = {GitHub repository},
221
+ howpublished = {\url{https://github.com/Beomi/KcELECTRA}}
222
+ }
223
+ ```
224
+
225
+ ๋…ผ๋ฌธ์„ ํ†ตํ•œ ์‚ฌ์šฉ ์™ธ์—๋Š” MIT ๋ผ์ด์„ผ์Šค๋ฅผ ํ‘œ๊ธฐํ•ด์ฃผ์„ธ์š”. โ˜บ๏ธ
226
+
227
+ ## Acknowledgement
228
+
229
+ KcELECTRA Model์„ ํ•™์Šตํ•˜๋Š” GCP/TPU ํ™˜๊ฒฝ์€ [TFRC](https://www.tensorflow.org/tfrc?hl=ko) ํ”„๋กœ๊ทธ๋žจ์˜ ์ง€์›์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค.
230
+
231
+ ๋ชจ๋ธ ํ•™์Šต ๊ณผ์ •์—์„œ ๋งŽ์€ ์กฐ์–ธ์„ ์ฃผ์‹  [Monologg](https://github.com/monologg/) ๋‹˜ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค :)
232
+
233
+ ## Reference
234
+
235
+ ### Github Repos
236
+
237
+ - [KcBERT by Beomi](https://github.com/Beomi/KcBERT)
238
+ - [BERT by Google](https://github.com/google-research/bert)
239
+ - [KoBERT by SKT](https://github.com/SKTBrain/KoBERT)
240
+ - [KoELECTRA by Monologg](https://github.com/monologg/KoELECTRA/)
241
+ - [Transformers by Huggingface](https://github.com/huggingface/transformers)
242
+ - [Tokenizers by Hugginface](https://github.com/huggingface/tokenizers)
243
+ - [ELECTRA train code by KLUE](https://github.com/KLUE-benchmark/KLUE-ELECTRA)
244
+
245
+ ### Blogs
246
+
247
+ - [Monologg๋‹˜์˜ KoELECTRA ํ•™์Šต๊ธฐ](https://monologg.kr/categories/NLP/ELECTRA/)
248
+ - [Colab์—์„œ TPU๋กœ BERT ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต์‹œํ‚ค๊ธฐ - Tensorflow/Google ver.](https://beomi.github.io/2020/02/26/Train-BERT-from-scratch-on-colab-TPU-Tensorflow-ver/)