KorSciBERT / README.md
KISTI-AIDATA's picture
Update README.md
0137895 verified
---
license: cc-by-nc-3.0
---
# ๊ณผํ•™๊ธฐ์ˆ ๋ถ„์•ผ BERT ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ (KorSci BERT)
๋ณธ KorSci BERT ์–ธ์–ด๋ชจ๋ธ์€ ํ•œ๊ตญ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณด์—ฐ๊ตฌ์›๊ณผ ํ•œ๊ตญํŠนํ—ˆ์ •๋ณด์›์ด ๊ณต๋™์œผ๋กœ ์—ฐ๊ตฌํ•œ ๊ณผ์ œ์˜ ๊ฒฐ๊ณผ๋ฌผ ์ค‘ ํ•˜๋‚˜๋กœ, ๊ธฐ์กด [Google BERT base](https://github.com/google-research/bert) ๋ชจ๋ธ์˜ ์•„ํ‚คํ…์ณ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, ํ•œ๊ตญ ๋…ผ๋ฌธ & ํŠนํ—ˆ ์ฝ”ํผ์Šค ์ด 97G (์•ฝ 3์–ต 8์ฒœ๋งŒ ๋ฌธ์žฅ)๋ฅผ ์‚ฌ์ „ํ•™์Šตํ•œ ๋ชจ๋ธ์ด๋‹ค.
## Train dataset
|Type|Corpus|Sentences|Avg sent length|
|--|--|--|--|
|๋…ผ๋ฌธ|15G|72,735,757|122.11|
|ํŠนํ—ˆ|82G|316,239,927|120.91|
|ํ•ฉ๊ณ„|97G|388,975,684|121.13|
## Model architecture
- attention_probs_dropout_prob:0.1
- directionality:"bidi"
- hidden_act:"gelu"
- hidden_dropout_prob:0.1
- hidden_size:768
- initializer_range:0.02
- intermediate_size:3072
- max_position_embeddings:512
- num_attention_heads:12
- num_hidden_layers:12
- pooler_fc_size:768
- pooler_num_attention_heads:12
- pooler_num_fc_layers:3
- pooler_size_per_head:128
- pooler_type:"first_token_transform"
- type_vocab_size:2
- vocab_size:15330
## Vocabulary
- Total 15,330 words
- Included special tokens ( [PAD], [UNK], [CLS], [SEP], [MASK] )
- File name : vocab_kisti.txt
## Language model
- Model file : model.ckpt-262500 (Tensorflow ckpt file)
## Pre training
- Trained 128 Seq length 1,600,000 + 512 Seq length 500,000 ์Šคํ… ํ•™์Šต
- ๋…ผ๋ฌธ+ํŠนํ—ˆ (97 GB) ๋ง๋ญ‰์น˜์˜ 3์–ต 8์ฒœ๋งŒ ๋ฌธ์žฅ ๋ฐ์ดํ„ฐ ํ•™์Šต
- NVIDIA V100 32G 8EA GPU ๋ถ„์‚ฐํ•™์Šต with [Horovod Lib](https://github.com/horovod/horovod)
- NVIDIA [Automixed Mixed Precision](https://developer.nvidia.com/automatic-mixed-precision) ๋ฐฉ์‹ ์‚ฌ์šฉ
## Downstream task evaluation
๋ณธ ์–ธ์–ด๋ชจ๋ธ์˜ ์„ฑ๋Šฅํ‰๊ฐ€๋Š” ๊ณผํ•™๊ธฐ์ˆ ํ‘œ์ค€๋ถ„๋ฅ˜ ๋ฐ ํŠนํ—ˆ ์„ ์ง„ํŠนํ—ˆ๋ถ„๋ฅ˜([CPC](https://www.kipo.go.kr/kpo/HtmlApp?c=4021&catmenu=m06_07_01)) 2๊ฐ€์ง€์˜ ํƒœ์Šคํฌ๋ฅผ ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ๊ทธ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
|Type|Classes|Train|Test|Metric|Train result|Test result|
|--|--|--|--|--|--|--|
|๊ณผํ•™๊ธฐ์ˆ ํ‘œ์ค€๋ถ„๋ฅ˜|86|130,515|14,502|Accuracy|68.21|70.31|
|ํŠนํ—ˆCPC๋ถ„๋ฅ˜|144|390,540|16,315|Accuracy|86.87|76.25|
# ๊ณผํ•™๊ธฐ์ˆ ๋ถ„์•ผ ํ† ํฌ๋‚˜์ด์ € (KorSci Tokenizer)
๋ณธ ํ† ํฌ๋‚˜์ด์ €๋Š” ํ•œ๊ตญ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณด์—ฐ๊ตฌ์›๊ณผ ํ•œ๊ตญํŠนํ—ˆ์ •๋ณด์›์ด ๊ณต๋™์œผ๋กœ ์—ฐ๊ตฌํ•œ ๊ณผ์ œ์˜ ๊ฒฐ๊ณผ๋ฌผ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ , ์œ„ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉ๋œ ์ฝ”ํผ์Šค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ช…์‚ฌ ๋ฐ ๋ณตํ•ฉ๋ช…์‚ฌ ์•ฝ 600๋งŒ๊ฐœ์˜ ์‚ฌ์šฉ์ž์‚ฌ์ „์ด ์ถ”๊ฐ€๋œ [Mecab-ko Tokenizer](https://bitbucket.org/eunjeon/mecab-ko/src/master/)์™€ ๊ธฐ์กด [BERT WordPiece Tokenizer](https://github.com/google-research/bert)๊ฐ€ ๋ณ‘ํ•ฉ๋˜์–ด์ง„ ํ† ํฌ๋‚˜์ด์ €์ด๋‹ค.
## ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ
http://doi.org/10.23057/46
## ์š”๊ตฌ์‚ฌํ•ญ
### ์€์ „ํ•œ๋‹ข Mecab ์„ค์น˜ & ์‚ฌ์šฉ์ž์‚ฌ์ „ ์ถ”๊ฐ€
Installation URL: https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/
mecab-ko > 0.996-ko-0.9.2
mecab-ko-dic > 2.1.1
mecab-python > 0.996-ko-0.9.2
### ๋…ผ๋ฌธ & ํŠนํ—ˆ ์‚ฌ์šฉ์ž ์‚ฌ์ „
- ๋…ผ๋ฌธ ์‚ฌ์šฉ์ž ์‚ฌ์ „ : pap_all_mecab_dic.csv (1,001,328 words)
- ํŠนํ—ˆ ์‚ฌ์šฉ์ž ์‚ฌ์ „ : pat_all_mecab_dic.csv (5,000,000 words)
### konlpy ์„ค์น˜
pip install konlpy
konlpy > 0.5.2
## ์‚ฌ์šฉ๋ฐฉ๋ฒ•
import tokenization_kisti as tokenization
vocab_file = "./vocab_kisti.txt"
tokenizer = tokenization.FullTokenizer(
vocab_file=vocab_file,
do_lower_case=False,
tokenizer_type="Mecab"
)
example = "๋ณธ ๊ณ ์•ˆ์€ ์ฃผ๋กœ ์ผํšŒ์šฉ ํ•ฉ์„ฑ์„ธ์ œ์•ก์„ ์ง‘์–ด๋„ฃ์–ด ๋ฐ€๋ด‰ํ•˜๋Š” ์„ธ์ œ์•กํฌ์˜ ๋‚ด๋ถ€๋ฅผ ์›ํ˜ธ์ƒ์œผ๋กœ ์—ด์ค‘์ฐฉํ•˜๋˜ ์„ธ์ œ์•ก์ด ๋ฐฐ์ถœ๋˜๋Š” ์ ˆ๋‹จ๋ถ€ ์ชฝ์œผ๋กœ ๋‚ด๋ฒฝ์„ ํ˜‘์†Œํ•˜๊ฒŒ ํ˜•์„ฑํ•˜์—ฌ์„œ ๋‚ด๋ถ€์— ๋“ค์–ด์žˆ๋Š” ์„ธ์ œ์•ก์„ ์ž˜์งœ์งˆ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ํ•ฉ์„ฑ์„ธ์ œ ์•กํฌ์— ๊ด€ํ•œ ๊ฒƒ์ด๋‹ค."
tokens = tokenizer.tokenize(example)
encoded_tokens = tokenizer.convert_tokens_to_ids(tokens)
decoded_tokens = tokenizer.convert_ids_to_tokens(encoded_tokens)
print("Input example ===>", example)
print("Tokenized example ===>", tokens)
print("Converted example to IDs ===>", encoded_tokens)
print("Converted IDs to example ===>", decoded_tokens)
============ Result ================
Input example ===> ๋ณธ ๊ณ ์•ˆ์€ ์ฃผ๋กœ ์ผํšŒ์šฉ ํ•ฉ์„ฑ์„ธ์ œ์•ก์„ ์ง‘์–ด๋„ฃ์–ด ๋ฐ€๋ด‰ํ•˜๋Š” ์„ธ์ œ์•กํฌ์˜ ๋‚ด๋ถ€๋ฅผ ์›ํ˜ธ์ƒ์œผ๋กœ ์—ด์ค‘์ฐฉํ•˜๋˜ ์„ธ์ œ์•ก์ด ๋ฐฐ์ถœ๋˜๋Š” ์ ˆ๋‹จ๋ถ€ ์ชฝ์œผ๋กœ ๋‚ด๋ฒฝ์„ ํ˜‘์†Œํ•˜๊ฒŒ ํ˜•์„ฑํ•˜์—ฌ์„œ ๋‚ด๋ถ€์— ๋“ค์–ด์žˆ๋Š” ์„ธ์ œ์•ก์„ ์ž˜์งœ์งˆ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ํ•ฉ์„ฑ์„ธ์ œ ์•กํฌ์— ๊ด€ํ•œ ๊ฒƒ์ด๋‹ค.
Tokenized example ===> ['๋ณธ', '๊ณ ์•ˆ', '์€', '์ฃผ๋กœ', '์ผํšŒ์šฉ', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '##์•ก', '์„', '์ง‘', '##์–ด', '##๋„ฃ', '์–ด', '๋ฐ€๋ด‰', 'ํ•˜', '๋Š”', '์„ธ์ œ', '##์•ก', '##ํฌ', '์˜', '๋‚ด๋ถ€', '๋ฅผ', '์›ํ˜ธ', '์ƒ', '์œผ๋กœ', '์—ด', '##์ค‘', '์ฐฉ', '##ํ•˜', '๋˜', '์„ธ์ œ', '##์•ก', '์ด', '๋ฐฐ์ถœ', '๋˜', '๋Š”', '์ ˆ๋‹จ๋ถ€', '์ชฝ', '์œผ๋กœ', '๋‚ด๋ฒฝ', '์„', 'ํ˜‘', '##์†Œ', 'ํ•˜', '๊ฒŒ', 'ํ˜•์„ฑ', 'ํ•˜', '์—ฌ์„œ', '๋‚ด๋ถ€', '์—', '๋“ค', '์–ด', '์žˆ', '๋Š”', '์„ธ์ œ', '##์•ก', '์„', '์ž˜', '์งœ', '์งˆ', '์ˆ˜', '์žˆ', '๋„๋ก', 'ํ•˜', '๋Š”', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '์•ก', '##ํฌ', '์—', '๊ด€ํ•œ', '๊ฒƒ', '์ด', '๋‹ค', '.']
Converted example to IDs ===> [59, 619, 30, 2336, 8268, 819, 14100, 13986, 14198, 15, 732, 13994, 14615, 39, 1964, 12, 11, 6174, 14198, 14061, 9, 366, 16, 7267, 18, 32, 307, 14072, 891, 13967, 27, 6174, 14198, 14, 698, 27, 11, 12920, 1972, 32, 4482, 15, 2228, 14053, 12, 65, 117, 12, 4477, 366, 10, 56, 39, 26, 11, 6174, 14198, 15, 1637, 13709, 398, 25, 26, 140, 12, 11, 819, 14100, 13986, 377, 14061, 10, 487, 55, 14, 17, 13]
Converted IDs to example ===> ['๋ณธ', '๊ณ ์•ˆ', '์€', '์ฃผ๋กœ', '์ผํšŒ์šฉ', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '##์•ก', '์„', '์ง‘', '##์–ด', '##๋„ฃ', '์–ด', '๋ฐ€๋ด‰', 'ํ•˜', '๋Š”', '์„ธ์ œ', '##์•ก', '##ํฌ', '์˜', '๋‚ด๋ถ€', '๋ฅผ', '์›ํ˜ธ', '์ƒ', '์œผ๋กœ', '์—ด', '##์ค‘', '์ฐฉ', '##ํ•˜', '๋˜', '์„ธ์ œ', '##์•ก', '์ด', '๋ฐฐ์ถœ', '๋˜', '๋Š”', '์ ˆ๋‹จ๋ถ€', '์ชฝ', '์œผ๋กœ', '๋‚ด๋ฒฝ', '์„', 'ํ˜‘', '##์†Œ', 'ํ•˜', '๊ฒŒ', 'ํ˜•์„ฑ', 'ํ•˜', '์—ฌ์„œ', '๋‚ด๋ถ€', '์—', '๋“ค', '์–ด', '์žˆ', '๋Š”', '์„ธ์ œ', '##์•ก', '์„', '์ž˜', '์งœ', '์งˆ', '์ˆ˜', '์žˆ', '๋„๋ก', 'ํ•˜', '๋Š”', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '์•ก', '##ํฌ', '์—', '๊ด€ํ•œ', '๊ฒƒ', '์ด', '๋‹ค', '.']
### Fine-tuning with KorSci-Bert
- [Google Bert](https://github.com/google-research/bert)์˜ Fine-tuning ๋ฐฉ๋ฒ• ์ฐธ๊ณ 
- Sentence (and sentence-pair) classification tasks: "run_classifier.py" ์ฝ”๋“œ ํ™œ์šฉ
- MRC(Machine Reading Comprehension) tasks: "run_squad.py" ์ฝ”๋“œ ํ™œ์šฉ