KISTI-AIDATA commited on
Commit
0137895
โ€ข
1 Parent(s): 48f961e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -3
README.md CHANGED
@@ -1,3 +1,110 @@
1
- ---
2
- license: cc-by-nc-3.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-3.0
3
+ ---
4
+
5
+ # ๊ณผํ•™๊ธฐ์ˆ ๋ถ„์•ผ BERT ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ (KorSci BERT)
6
+ ๋ณธ KorSci BERT ์–ธ์–ด๋ชจ๋ธ์€ ํ•œ๊ตญ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณด์—ฐ๊ตฌ์›๊ณผ ํ•œ๊ตญํŠนํ—ˆ์ •๋ณด์›์ด ๊ณต๋™์œผ๋กœ ์—ฐ๊ตฌํ•œ ๊ณผ์ œ์˜ ๊ฒฐ๊ณผ๋ฌผ ์ค‘ ํ•˜๋‚˜๋กœ, ๊ธฐ์กด [Google BERT base](https://github.com/google-research/bert) ๋ชจ๋ธ์˜ ์•„ํ‚คํ…์ณ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, ํ•œ๊ตญ ๋…ผ๋ฌธ & ํŠนํ—ˆ ์ฝ”ํผ์Šค ์ด 97G (์•ฝ 3์–ต 8์ฒœ๋งŒ ๋ฌธ์žฅ)๋ฅผ ์‚ฌ์ „ํ•™์Šตํ•œ ๋ชจ๋ธ์ด๋‹ค.
7
+
8
+ ## Train dataset
9
+ |Type|Corpus|Sentences|Avg sent length|
10
+ |--|--|--|--|
11
+ |๋…ผ๋ฌธ|15G|72,735,757|122.11|
12
+ |ํŠนํ—ˆ|82G|316,239,927|120.91|
13
+ |ํ•ฉ๊ณ„|97G|388,975,684|121.13|
14
+
15
+ ## Model architecture
16
+ - attention_probs_dropout_prob:0.1
17
+ - directionality:"bidi"
18
+ - hidden_act:"gelu"
19
+ - hidden_dropout_prob:0.1
20
+ - hidden_size:768
21
+ - initializer_range:0.02
22
+ - intermediate_size:3072
23
+ - max_position_embeddings:512
24
+ - num_attention_heads:12
25
+ - num_hidden_layers:12
26
+ - pooler_fc_size:768
27
+ - pooler_num_attention_heads:12
28
+ - pooler_num_fc_layers:3
29
+ - pooler_size_per_head:128
30
+ - pooler_type:"first_token_transform"
31
+ - type_vocab_size:2
32
+ - vocab_size:15330
33
+
34
+ ## Vocabulary
35
+ - Total 15,330 words
36
+ - Included special tokens ( [PAD], [UNK], [CLS], [SEP], [MASK] )
37
+ - File name : vocab_kisti.txt
38
+
39
+ ## Language model
40
+ - Model file : model.ckpt-262500 (Tensorflow ckpt file)
41
+
42
+ ## Pre training
43
+ - Trained 128 Seq length 1,600,000 + 512 Seq length 500,000 ์Šคํ… ํ•™์Šต
44
+ - ๋…ผ๋ฌธ+ํŠนํ—ˆ (97 GB) ๋ง๋ญ‰์น˜์˜ 3์–ต 8์ฒœ๋งŒ ๋ฌธ์žฅ ๋ฐ์ดํ„ฐ ํ•™์Šต
45
+ - NVIDIA V100 32G 8EA GPU ๋ถ„์‚ฐํ•™์Šต with [Horovod Lib](https://github.com/horovod/horovod)
46
+ - NVIDIA [Automixed Mixed Precision](https://developer.nvidia.com/automatic-mixed-precision) ๋ฐฉ์‹ ์‚ฌ์šฉ
47
+
48
+ ## Downstream task evaluation
49
+ ๋ณธ ์–ธ์–ด๋ชจ๋ธ์˜ ์„ฑ๋Šฅํ‰๊ฐ€๋Š” ๊ณผํ•™๊ธฐ์ˆ ํ‘œ์ค€๋ถ„๋ฅ˜ ๋ฐ ํŠนํ—ˆ ์„ ์ง„ํŠนํ—ˆ๋ถ„๋ฅ˜([CPC](https://www.kipo.go.kr/kpo/HtmlApp?c=4021&catmenu=m06_07_01)) 2๊ฐ€์ง€์˜ ํƒœ์Šคํฌ๋ฅผ ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ๊ทธ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
50
+ |Type|Classes|Train|Test|Metric|Train result|Test result|
51
+ |--|--|--|--|--|--|--|
52
+ |๊ณผํ•™๊ธฐ์ˆ ํ‘œ์ค€๋ถ„๋ฅ˜|86|130,515|14,502|Accuracy|68.21|70.31|
53
+ |ํŠนํ—ˆCPC๋ถ„๋ฅ˜|144|390,540|16,315|Accuracy|86.87|76.25|
54
+
55
+
56
+ # ๊ณผํ•™๊ธฐ์ˆ ๋ถ„์•ผ ํ† ํฌ๋‚˜์ด์ € (KorSci Tokenizer)
57
+
58
+ ๋ณธ ํ† ํฌ๋‚˜์ด์ €๋Š” ํ•œ๊ตญ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณด์—ฐ๊ตฌ์›๊ณผ ํ•œ๊ตญํŠนํ—ˆ์ •๋ณด์›์ด ๊ณต๋™์œผ๋กœ ์—ฐ๊ตฌํ•œ ๊ณผ์ œ์˜ ๊ฒฐ๊ณผ๋ฌผ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ , ์œ„ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉ๋œ ์ฝ”ํผ์Šค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ช…์‚ฌ ๋ฐ ๋ณตํ•ฉ๋ช…์‚ฌ ์•ฝ 600๋งŒ๊ฐœ์˜ ์‚ฌ์šฉ์ž์‚ฌ์ „์ด ์ถ”๊ฐ€๋œ [Mecab-ko Tokenizer](https://bitbucket.org/eunjeon/mecab-ko/src/master/)์™€ ๊ธฐ์กด [BERT WordPiece Tokenizer](https://github.com/google-research/bert)๊ฐ€ ๋ณ‘ํ•ฉ๋˜์–ด์ง„ ํ† ํฌ๋‚˜์ด์ €์ด๋‹ค.
59
+
60
+ ## ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ
61
+ http://doi.org/10.23057/46
62
+
63
+ ## ์š”๊ตฌ์‚ฌํ•ญ
64
+
65
+ ### ์€์ „ํ•œ๋‹ข Mecab ์„ค์น˜ & ์‚ฌ์šฉ์ž์‚ฌ์ „ ์ถ”๊ฐ€
66
+ Installation URL: https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/
67
+ mecab-ko > 0.996-ko-0.9.2
68
+ mecab-ko-dic > 2.1.1
69
+ mecab-python > 0.996-ko-0.9.2
70
+
71
+ ### ๋…ผ๋ฌธ & ํŠนํ—ˆ ์‚ฌ์šฉ์ž ์‚ฌ์ „
72
+ - ๋…ผ๋ฌธ ์‚ฌ์šฉ์ž ์‚ฌ์ „ : pap_all_mecab_dic.csv (1,001,328 words)
73
+ - ํŠนํ—ˆ ์‚ฌ์šฉ์ž ์‚ฌ์ „ : pat_all_mecab_dic.csv (5,000,000 words)
74
+
75
+ ### konlpy ์„ค์น˜
76
+ pip install konlpy
77
+ konlpy > 0.5.2
78
+
79
+ ## ์‚ฌ์šฉ๋ฐฉ๋ฒ•
80
+ import tokenization_kisti as tokenization
81
+
82
+ vocab_file = "./vocab_kisti.txt"
83
+
84
+ tokenizer = tokenization.FullTokenizer(
85
+ vocab_file=vocab_file,
86
+ do_lower_case=False,
87
+ tokenizer_type="Mecab"
88
+ )
89
+
90
+ example = "๋ณธ ๊ณ ์•ˆ์€ ์ฃผ๋กœ ์ผํšŒ์šฉ ํ•ฉ์„ฑ์„ธ์ œ์•ก์„ ์ง‘์–ด๋„ฃ์–ด ๋ฐ€๋ด‰ํ•˜๋Š” ์„ธ์ œ์•กํฌ์˜ ๋‚ด๋ถ€๋ฅผ ์›ํ˜ธ์ƒ์œผ๋กœ ์—ด์ค‘์ฐฉํ•˜๋˜ ์„ธ์ œ์•ก์ด ๋ฐฐ์ถœ๋˜๋Š” ์ ˆ๋‹จ๋ถ€ ์ชฝ์œผ๋กœ ๋‚ด๋ฒฝ์„ ํ˜‘์†Œํ•˜๊ฒŒ ํ˜•์„ฑํ•˜์—ฌ์„œ ๋‚ด๋ถ€์— ๋“ค์–ด์žˆ๋Š” ์„ธ์ œ์•ก์„ ์ž˜์งœ์งˆ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ํ•ฉ์„ฑ์„ธ์ œ ์•กํฌ์— ๊ด€ํ•œ ๊ฒƒ์ด๋‹ค."
91
+ tokens = tokenizer.tokenize(example)
92
+ encoded_tokens = tokenizer.convert_tokens_to_ids(tokens)
93
+ decoded_tokens = tokenizer.convert_ids_to_tokens(encoded_tokens)
94
+
95
+ print("Input example ===>", example)
96
+ print("Tokenized example ===>", tokens)
97
+ print("Converted example to IDs ===>", encoded_tokens)
98
+ print("Converted IDs to example ===>", decoded_tokens)
99
+
100
+ ============ Result ================
101
+ Input example ===> ๋ณธ ๊ณ ์•ˆ์€ ์ฃผ๋กœ ์ผํšŒ์šฉ ํ•ฉ์„ฑ์„ธ์ œ์•ก์„ ์ง‘์–ด๋„ฃ์–ด ๋ฐ€๋ด‰ํ•˜๋Š” ์„ธ์ œ์•กํฌ์˜ ๋‚ด๋ถ€๋ฅผ ์›ํ˜ธ์ƒ์œผ๋กœ ์—ด์ค‘์ฐฉํ•˜๋˜ ์„ธ์ œ์•ก์ด ๋ฐฐ์ถœ๋˜๋Š” ์ ˆ๋‹จ๋ถ€ ์ชฝ์œผ๋กœ ๋‚ด๋ฒฝ์„ ํ˜‘์†Œํ•˜๊ฒŒ ํ˜•์„ฑํ•˜์—ฌ์„œ ๋‚ด๋ถ€์— ๋“ค์–ด์žˆ๋Š” ์„ธ์ œ์•ก์„ ์ž˜์งœ์งˆ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ํ•ฉ์„ฑ์„ธ์ œ ์•กํฌ์— ๊ด€ํ•œ ๊ฒƒ์ด๋‹ค.
102
+ Tokenized example ===> ['๋ณธ', '๊ณ ์•ˆ', '์€', '์ฃผ๋กœ', '์ผํšŒ์šฉ', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '##์•ก', '์„', '์ง‘', '##์–ด', '##๋„ฃ', '์–ด', '๋ฐ€๋ด‰', 'ํ•˜', '๋Š”', '์„ธ์ œ', '##์•ก', '##ํฌ', '์˜', '๋‚ด๋ถ€', '๋ฅผ', '์›ํ˜ธ', '์ƒ', '์œผ๋กœ', '์—ด', '##์ค‘', '์ฐฉ', '##ํ•˜', '๋˜', '์„ธ์ œ', '##์•ก', '์ด', '๋ฐฐ์ถœ', '๋˜', '๋Š”', '์ ˆ๋‹จ๋ถ€', '์ชฝ', '์œผ๋กœ', '๋‚ด๋ฒฝ', '์„', 'ํ˜‘', '##์†Œ', 'ํ•˜', '๊ฒŒ', 'ํ˜•์„ฑ', 'ํ•˜', '์—ฌ์„œ', '๋‚ด๋ถ€', '์—', '๋“ค', '์–ด', '์žˆ', '๋Š”', '์„ธ์ œ', '##์•ก', '์„', '์ž˜', '์งœ', '์งˆ', '์ˆ˜', '์žˆ', '๋„๋ก', 'ํ•˜', '๋Š”', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '์•ก', '##ํฌ', '์—', '๊ด€ํ•œ', '๊ฒƒ', '์ด', '๋‹ค', '.']
103
+ Converted example to IDs ===> [59, 619, 30, 2336, 8268, 819, 14100, 13986, 14198, 15, 732, 13994, 14615, 39, 1964, 12, 11, 6174, 14198, 14061, 9, 366, 16, 7267, 18, 32, 307, 14072, 891, 13967, 27, 6174, 14198, 14, 698, 27, 11, 12920, 1972, 32, 4482, 15, 2228, 14053, 12, 65, 117, 12, 4477, 366, 10, 56, 39, 26, 11, 6174, 14198, 15, 1637, 13709, 398, 25, 26, 140, 12, 11, 819, 14100, 13986, 377, 14061, 10, 487, 55, 14, 17, 13]
104
+ Converted IDs to example ===> ['๋ณธ', '๊ณ ์•ˆ', '์€', '์ฃผ๋กœ', '์ผํšŒ์šฉ', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '##์•ก', '์„', '์ง‘', '##์–ด', '##๋„ฃ', '์–ด', '๋ฐ€๋ด‰', 'ํ•˜', '๋Š”', '์„ธ์ œ', '##์•ก', '##ํฌ', '์˜', '๋‚ด๋ถ€', '๋ฅผ', '์›ํ˜ธ', '์ƒ', '์œผ๋กœ', '์—ด', '##์ค‘', '์ฐฉ', '##ํ•˜', '๋˜', '์„ธ์ œ', '##์•ก', '์ด', '๋ฐฐ์ถœ', '๋˜', '๋Š”', '์ ˆ๋‹จ๋ถ€', '์ชฝ', '์œผ๋กœ', '๋‚ด๋ฒฝ', '์„', 'ํ˜‘', '##์†Œ', 'ํ•˜', '๊ฒŒ', 'ํ˜•์„ฑ', 'ํ•˜', '์—ฌ์„œ', '๋‚ด๋ถ€', '์—', '๋“ค', '์–ด', '์žˆ', '๋Š”', '์„ธ์ œ', '##์•ก', '์„', '์ž˜', '์งœ', '์งˆ', '์ˆ˜', '์žˆ', '๋„๋ก', 'ํ•˜', '๋Š”', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '์•ก', '##ํฌ', '์—', '๊ด€ํ•œ', '๊ฒƒ', '์ด', '๋‹ค', '.']
105
+
106
+
107
+ ### Fine-tuning with KorSci-Bert
108
+ - [Google Bert](https://github.com/google-research/bert)์˜ Fine-tuning ๋ฐฉ๋ฒ• ์ฐธ๊ณ 
109
+ - Sentence (and sentence-pair) classification tasks: "run_classifier.py" ์ฝ”๋“œ ํ™œ์šฉ
110
+ - MRC(Machine Reading Comprehension) tasks: "run_squad.py" ์ฝ”๋“œ ํ™œ์šฉ