File size: 6,643 Bytes
0137895
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
license: cc-by-nc-3.0
---

# ๊ณผํ•™๊ธฐ์ˆ ๋ถ„์•ผ BERT ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ (KorSci BERT)
 ๋ณธ KorSci BERT ์–ธ์–ด๋ชจ๋ธ์€ ํ•œ๊ตญ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณด์—ฐ๊ตฌ์›๊ณผ ํ•œ๊ตญํŠนํ—ˆ์ •๋ณด์›์ด ๊ณต๋™์œผ๋กœ ์—ฐ๊ตฌํ•œ ๊ณผ์ œ์˜  ๊ฒฐ๊ณผ๋ฌผ ์ค‘ ํ•˜๋‚˜๋กœ, ๊ธฐ์กด [Google BERT base](https://github.com/google-research/bert) ๋ชจ๋ธ์˜ ์•„ํ‚คํ…์ณ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, ํ•œ๊ตญ ๋…ผ๋ฌธ & ํŠนํ—ˆ ์ฝ”ํผ์Šค ์ด 97G (์•ฝ 3์–ต 8์ฒœ๋งŒ ๋ฌธ์žฅ)๋ฅผ ์‚ฌ์ „ํ•™์Šตํ•œ ๋ชจ๋ธ์ด๋‹ค.

## Train dataset
|Type|Corpus|Sentences|Avg sent length|
|--|--|--|--|
|๋…ผ๋ฌธ|15G|72,735,757|122.11|
|ํŠนํ—ˆ|82G|316,239,927|120.91|
|ํ•ฉ๊ณ„|97G|388,975,684|121.13|

## Model architecture
-   attention_probs_dropout_prob:0.1
-   directionality:"bidi"
-   hidden_act:"gelu"
-   hidden_dropout_prob:0.1
-   hidden_size:768
-   initializer_range:0.02
-   intermediate_size:3072
-   max_position_embeddings:512
-   num_attention_heads:12
-   num_hidden_layers:12
-   pooler_fc_size:768
-   pooler_num_attention_heads:12
-   pooler_num_fc_layers:3
-   pooler_size_per_head:128
-   pooler_type:"first_token_transform"
-   type_vocab_size:2
-   vocab_size:15330

## Vocabulary
 - Total 15,330 words
 - Included special tokens ( [PAD], [UNK], [CLS], [SEP], [MASK] )
 - File name : vocab_kisti.txt

## Language model
- Model file : model.ckpt-262500 (Tensorflow ckpt file)

## Pre training
- Trained 128 Seq length 1,600,000 + 512 Seq length 500,000 ์Šคํ… ํ•™์Šต
- ๋…ผ๋ฌธ+ํŠนํ—ˆ (97 GB) ๋ง๋ญ‰์น˜์˜ 3์–ต 8์ฒœ๋งŒ ๋ฌธ์žฅ ๋ฐ์ดํ„ฐ ํ•™์Šต
- NVIDIA V100 32G 8EA GPU ๋ถ„์‚ฐํ•™์Šต with [Horovod Lib](https://github.com/horovod/horovod)
- NVIDIA [Automixed Mixed Precision](https://developer.nvidia.com/automatic-mixed-precision) ๋ฐฉ์‹ ์‚ฌ์šฉ

## Downstream task evaluation
๋ณธ ์–ธ์–ด๋ชจ๋ธ์˜ ์„ฑ๋Šฅํ‰๊ฐ€๋Š” ๊ณผํ•™๊ธฐ์ˆ ํ‘œ์ค€๋ถ„๋ฅ˜ ๋ฐ ํŠนํ—ˆ ์„ ์ง„ํŠนํ—ˆ๋ถ„๋ฅ˜([CPC](https://www.kipo.go.kr/kpo/HtmlApp?c=4021&catmenu=m06_07_01)) 2๊ฐ€์ง€์˜ ํƒœ์Šคํฌ๋ฅผ ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ๊ทธ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
|Type|Classes|Train|Test|Metric|Train result|Test result|
|--|--|--|--|--|--|--|
|๊ณผํ•™๊ธฐ์ˆ ํ‘œ์ค€๋ถ„๋ฅ˜|86|130,515|14,502|Accuracy|68.21|70.31|
|ํŠนํ—ˆCPC๋ถ„๋ฅ˜|144|390,540|16,315|Accuracy|86.87|76.25|


# ๊ณผํ•™๊ธฐ์ˆ ๋ถ„์•ผ ํ† ํฌ๋‚˜์ด์ € (KorSci Tokenizer)

๋ณธ ํ† ํฌ๋‚˜์ด์ €๋Š” ํ•œ๊ตญ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณด์—ฐ๊ตฌ์›๊ณผ ํ•œ๊ตญํŠนํ—ˆ์ •๋ณด์›์ด ๊ณต๋™์œผ๋กœ ์—ฐ๊ตฌํ•œ ๊ณผ์ œ์˜  ๊ฒฐ๊ณผ๋ฌผ ์ค‘ ํ•˜๋‚˜์ด๋‹ค.  ๊ทธ๋ฆฌ๊ณ , ์œ„ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉ๋œ ์ฝ”ํผ์Šค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ช…์‚ฌ ๋ฐ ๋ณตํ•ฉ๋ช…์‚ฌ ์•ฝ 600๋งŒ๊ฐœ์˜ ์‚ฌ์šฉ์ž์‚ฌ์ „์ด ์ถ”๊ฐ€๋œ [Mecab-ko Tokenizer](https://bitbucket.org/eunjeon/mecab-ko/src/master/)์™€ ๊ธฐ์กด [BERT WordPiece Tokenizer](https://github.com/google-research/bert)๊ฐ€ ๋ณ‘ํ•ฉ๋˜์–ด์ง„ ํ† ํฌ๋‚˜์ด์ €์ด๋‹ค.

##  ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ
http://doi.org/10.23057/46

##  ์š”๊ตฌ์‚ฌํ•ญ

### ์€์ „ํ•œ๋‹ข Mecab ์„ค์น˜ & ์‚ฌ์šฉ์ž์‚ฌ์ „ ์ถ”๊ฐ€
	Installation URL: https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/
	mecab-ko > 0.996-ko-0.9.2
	mecab-ko-dic > 2.1.1
	mecab-python > 0.996-ko-0.9.2

### ๋…ผ๋ฌธ & ํŠนํ—ˆ ์‚ฌ์šฉ์ž ์‚ฌ์ „
- ๋…ผ๋ฌธ ์‚ฌ์šฉ์ž ์‚ฌ์ „ : pap_all_mecab_dic.csv (1,001,328 words)
- ํŠนํ—ˆ ์‚ฌ์šฉ์ž ์‚ฌ์ „ : pat_all_mecab_dic.csv (5,000,000 words)

### konlpy  ์„ค์น˜
	pip install konlpy
	konlpy > 0.5.2

##  ์‚ฌ์šฉ๋ฐฉ๋ฒ•
	import tokenization_kisti as tokenization
	 
	vocab_file = "./vocab_kisti.txt"  

	tokenizer = tokenization.FullTokenizer(  
							vocab_file=vocab_file,  
							do_lower_case=False,  
							tokenizer_type="Mecab"  
						)  
  
	example = "๋ณธ ๊ณ ์•ˆ์€ ์ฃผ๋กœ ์ผํšŒ์šฉ ํ•ฉ์„ฑ์„ธ์ œ์•ก์„ ์ง‘์–ด๋„ฃ์–ด ๋ฐ€๋ด‰ํ•˜๋Š” ์„ธ์ œ์•กํฌ์˜ ๋‚ด๋ถ€๋ฅผ ์›ํ˜ธ์ƒ์œผ๋กœ ์—ด์ค‘์ฐฉํ•˜๋˜ ์„ธ์ œ์•ก์ด ๋ฐฐ์ถœ๋˜๋Š” ์ ˆ๋‹จ๋ถ€ ์ชฝ์œผ๋กœ ๋‚ด๋ฒฝ์„ ํ˜‘์†Œํ•˜๊ฒŒ ํ˜•์„ฑํ•˜์—ฌ์„œ ๋‚ด๋ถ€์— ๋“ค์–ด์žˆ๋Š” ์„ธ์ œ์•ก์„ ์ž˜์งœ์งˆ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ํ•ฉ์„ฑ์„ธ์ œ ์•กํฌ์— ๊ด€ํ•œ ๊ฒƒ์ด๋‹ค."  
	tokens = tokenizer.tokenize(example)  
	encoded_tokens = tokenizer.convert_tokens_to_ids(tokens)  
	decoded_tokens = tokenizer.convert_ids_to_tokens(encoded_tokens)  
	  
	print("Input example ===>", example)  
	print("Tokenized example ===>", tokens)  
	print("Converted example to IDs ===>", encoded_tokens)  
	print("Converted IDs to example ===>", decoded_tokens)
	
	============ Result ================
	Input example ===> ๋ณธ ๊ณ ์•ˆ์€ ์ฃผ๋กœ ์ผํšŒ์šฉ ํ•ฉ์„ฑ์„ธ์ œ์•ก์„ ์ง‘์–ด๋„ฃ์–ด ๋ฐ€๋ด‰ํ•˜๋Š” ์„ธ์ œ์•กํฌ์˜ ๋‚ด๋ถ€๋ฅผ ์›ํ˜ธ์ƒ์œผ๋กœ ์—ด์ค‘์ฐฉํ•˜๋˜ ์„ธ์ œ์•ก์ด ๋ฐฐ์ถœ๋˜๋Š” ์ ˆ๋‹จ๋ถ€ ์ชฝ์œผ๋กœ ๋‚ด๋ฒฝ์„ ํ˜‘์†Œํ•˜๊ฒŒ ํ˜•์„ฑํ•˜์—ฌ์„œ ๋‚ด๋ถ€์— ๋“ค์–ด์žˆ๋Š” ์„ธ์ œ์•ก์„ ์ž˜์งœ์งˆ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ํ•ฉ์„ฑ์„ธ์ œ ์•กํฌ์— ๊ด€ํ•œ ๊ฒƒ์ด๋‹ค.
	Tokenized example ===> ['๋ณธ', '๊ณ ์•ˆ', '์€', '์ฃผ๋กœ', '์ผํšŒ์šฉ', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '##์•ก', '์„', '์ง‘', '##์–ด', '##๋„ฃ', '์–ด', '๋ฐ€๋ด‰', 'ํ•˜', '๋Š”', '์„ธ์ œ', '##์•ก', '##ํฌ', '์˜', '๋‚ด๋ถ€', '๋ฅผ', '์›ํ˜ธ', '์ƒ', '์œผ๋กœ', '์—ด', '##์ค‘', '์ฐฉ', '##ํ•˜', '๋˜', '์„ธ์ œ', '##์•ก', '์ด', '๋ฐฐ์ถœ', '๋˜', '๋Š”', '์ ˆ๋‹จ๋ถ€', '์ชฝ', '์œผ๋กœ', '๋‚ด๋ฒฝ', '์„', 'ํ˜‘', '##์†Œ', 'ํ•˜', '๊ฒŒ', 'ํ˜•์„ฑ', 'ํ•˜', '์—ฌ์„œ', '๋‚ด๋ถ€', '์—', '๋“ค', '์–ด', '์žˆ', '๋Š”', '์„ธ์ œ', '##์•ก', '์„', '์ž˜', '์งœ', '์งˆ', '์ˆ˜', '์žˆ', '๋„๋ก', 'ํ•˜', '๋Š”', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '์•ก', '##ํฌ', '์—', '๊ด€ํ•œ', '๊ฒƒ', '์ด', '๋‹ค', '.']
	Converted example to IDs ===> [59, 619, 30, 2336, 8268, 819, 14100, 13986, 14198, 15, 732, 13994, 14615, 39, 1964, 12, 11, 6174, 14198, 14061, 9, 366, 16, 7267, 18, 32, 307, 14072, 891, 13967, 27, 6174, 14198, 14, 698, 27, 11, 12920, 1972, 32, 4482, 15, 2228, 14053, 12, 65, 117, 12, 4477, 366, 10, 56, 39, 26, 11, 6174, 14198, 15, 1637, 13709, 398, 25, 26, 140, 12, 11, 819, 14100, 13986, 377, 14061, 10, 487, 55, 14, 17, 13]
	Converted IDs to example ===> ['๋ณธ', '๊ณ ์•ˆ', '์€', '์ฃผ๋กœ', '์ผํšŒ์šฉ', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '##์•ก', '์„', '์ง‘', '##์–ด', '##๋„ฃ', '์–ด', '๋ฐ€๋ด‰', 'ํ•˜', '๋Š”', '์„ธ์ œ', '##์•ก', '##ํฌ', '์˜', '๋‚ด๋ถ€', '๋ฅผ', '์›ํ˜ธ', '์ƒ', '์œผ๋กœ', '์—ด', '##์ค‘', '์ฐฉ', '##ํ•˜', '๋˜', '์„ธ์ œ', '##์•ก', '์ด', '๋ฐฐ์ถœ', '๋˜', '๋Š”', '์ ˆ๋‹จ๋ถ€', '์ชฝ', '์œผ๋กœ', '๋‚ด๋ฒฝ', '์„', 'ํ˜‘', '##์†Œ', 'ํ•˜', '๊ฒŒ', 'ํ˜•์„ฑ', 'ํ•˜', '์—ฌ์„œ', '๋‚ด๋ถ€', '์—', '๋“ค', '์–ด', '์žˆ', '๋Š”', '์„ธ์ œ', '##์•ก', '์„', '์ž˜', '์งœ', '์งˆ', '์ˆ˜', '์žˆ', '๋„๋ก', 'ํ•˜', '๋Š”', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '์•ก', '##ํฌ', '์—', '๊ด€ํ•œ', '๊ฒƒ', '์ด', '๋‹ค', '.']
	
	
### Fine-tuning with KorSci-Bert
- [Google Bert](https://github.com/google-research/bert)์˜ Fine-tuning ๋ฐฉ๋ฒ• ์ฐธ๊ณ 
- Sentence (and sentence-pair) classification tasks: "run_classifier.py" ์ฝ”๋“œ ํ™œ์šฉ
- MRC(Machine Reading Comprehension) tasks: "run_squad.py" ์ฝ”๋“œ ํ™œ์šฉ