๊ณผํ•™๊ธฐ์ˆ ๋ถ„์•ผ BERT ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ (KorSci BERT)

๋ณธ KorSci BERT ์–ธ์–ด๋ชจ๋ธ์€ ํ•œ๊ตญ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณด์—ฐ๊ตฌ์›๊ณผ ํ•œ๊ตญํŠนํ—ˆ์ •๋ณด์›์ด ๊ณต๋™์œผ๋กœ ์—ฐ๊ตฌํ•œ ๊ณผ์ œ์˜ ๊ฒฐ๊ณผ๋ฌผ ์ค‘ ํ•˜๋‚˜๋กœ, ๊ธฐ์กด Google BERT base ๋ชจ๋ธ์˜ ์•„ํ‚คํ…์ณ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, ํ•œ๊ตญ ๋…ผ๋ฌธ & ํŠนํ—ˆ ์ฝ”ํผ์Šค ์ด 97G (์•ฝ 3์–ต 8์ฒœ๋งŒ ๋ฌธ์žฅ)๋ฅผ ์‚ฌ์ „ํ•™์Šตํ•œ ๋ชจ๋ธ์ด๋‹ค.

Train dataset

Type Corpus Sentences Avg sent length
๋…ผ๋ฌธ 15G 72,735,757 122.11
ํŠนํ—ˆ 82G 316,239,927 120.91
ํ•ฉ๊ณ„ 97G 388,975,684 121.13

Model architecture

  • attention_probs_dropout_prob:0.1
  • directionality:"bidi"
  • hidden_act:"gelu"
  • hidden_dropout_prob:0.1
  • hidden_size:768
  • initializer_range:0.02
  • intermediate_size:3072
  • max_position_embeddings:512
  • num_attention_heads:12
  • num_hidden_layers:12
  • pooler_fc_size:768
  • pooler_num_attention_heads:12
  • pooler_num_fc_layers:3
  • pooler_size_per_head:128
  • pooler_type:"first_token_transform"
  • type_vocab_size:2
  • vocab_size:15330

Vocabulary

  • Total 15,330 words
  • Included special tokens ( [PAD], [UNK], [CLS], [SEP], [MASK] )
  • File name : vocab_kisti.txt

Language model

  • Model file : model.ckpt-262500 (Tensorflow ckpt file)

Pre training

  • Trained 128 Seq length 1,600,000 + 512 Seq length 500,000 ์Šคํ… ํ•™์Šต
  • ๋…ผ๋ฌธ+ํŠนํ—ˆ (97 GB) ๋ง๋ญ‰์น˜์˜ 3์–ต 8์ฒœ๋งŒ ๋ฌธ์žฅ ๋ฐ์ดํ„ฐ ํ•™์Šต
  • NVIDIA V100 32G 8EA GPU ๋ถ„์‚ฐํ•™์Šต with Horovod Lib
  • NVIDIA Automixed Mixed Precision ๋ฐฉ์‹ ์‚ฌ์šฉ

Downstream task evaluation

๋ณธ ์–ธ์–ด๋ชจ๋ธ์˜ ์„ฑ๋Šฅํ‰๊ฐ€๋Š” ๊ณผํ•™๊ธฐ์ˆ ํ‘œ์ค€๋ถ„๋ฅ˜ ๋ฐ ํŠนํ—ˆ ์„ ์ง„ํŠนํ—ˆ๋ถ„๋ฅ˜(CPC) 2๊ฐ€์ง€์˜ ํƒœ์Šคํฌ๋ฅผ ํŒŒ์ธํŠœ๋‹ํ•˜์—ฌ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ๊ทธ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

Type Classes Train Test Metric Train result Test result
๊ณผํ•™๊ธฐ์ˆ ํ‘œ์ค€๋ถ„๋ฅ˜ 86 130,515 14,502 Accuracy 68.21 70.31
ํŠนํ—ˆCPC๋ถ„๋ฅ˜ 144 390,540 16,315 Accuracy 86.87 76.25

๊ณผํ•™๊ธฐ์ˆ ๋ถ„์•ผ ํ† ํฌ๋‚˜์ด์ € (KorSci Tokenizer)

๋ณธ ํ† ํฌ๋‚˜์ด์ €๋Š” ํ•œ๊ตญ๊ณผํ•™๊ธฐ์ˆ ์ •๋ณด์—ฐ๊ตฌ์›๊ณผ ํ•œ๊ตญํŠนํ—ˆ์ •๋ณด์›์ด ๊ณต๋™์œผ๋กœ ์—ฐ๊ตฌํ•œ ๊ณผ์ œ์˜ ๊ฒฐ๊ณผ๋ฌผ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ , ์œ„ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉ๋œ ์ฝ”ํผ์Šค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ช…์‚ฌ ๋ฐ ๋ณตํ•ฉ๋ช…์‚ฌ ์•ฝ 600๋งŒ๊ฐœ์˜ ์‚ฌ์šฉ์ž์‚ฌ์ „์ด ์ถ”๊ฐ€๋œ Mecab-ko Tokenizer์™€ ๊ธฐ์กด BERT WordPiece Tokenizer๊ฐ€ ๋ณ‘ํ•ฉ๋˜์–ด์ง„ ํ† ํฌ๋‚˜์ด์ €์ด๋‹ค.

๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ

http://doi.org/10.23057/46

์š”๊ตฌ์‚ฌํ•ญ

์€์ „ํ•œ๋‹ข Mecab ์„ค์น˜ & ์‚ฌ์šฉ์ž์‚ฌ์ „ ์ถ”๊ฐ€

Installation URL: https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/
mecab-ko > 0.996-ko-0.9.2
mecab-ko-dic > 2.1.1
mecab-python > 0.996-ko-0.9.2

๋…ผ๋ฌธ & ํŠนํ—ˆ ์‚ฌ์šฉ์ž ์‚ฌ์ „

  • ๋…ผ๋ฌธ ์‚ฌ์šฉ์ž ์‚ฌ์ „ : pap_all_mecab_dic.csv (1,001,328 words)
  • ํŠนํ—ˆ ์‚ฌ์šฉ์ž ์‚ฌ์ „ : pat_all_mecab_dic.csv (5,000,000 words)

konlpy ์„ค์น˜

pip install konlpy
konlpy > 0.5.2

์‚ฌ์šฉ๋ฐฉ๋ฒ•

import tokenization_kisti as tokenization
 
vocab_file = "./vocab_kisti.txt"  

tokenizer = tokenization.FullTokenizer(  
                        vocab_file=vocab_file,  
                        do_lower_case=False,  
                        tokenizer_type="Mecab"  
                    )  

example = "๋ณธ ๊ณ ์•ˆ์€ ์ฃผ๋กœ ์ผํšŒ์šฉ ํ•ฉ์„ฑ์„ธ์ œ์•ก์„ ์ง‘์–ด๋„ฃ์–ด ๋ฐ€๋ด‰ํ•˜๋Š” ์„ธ์ œ์•กํฌ์˜ ๋‚ด๋ถ€๋ฅผ ์›ํ˜ธ์ƒ์œผ๋กœ ์—ด์ค‘์ฐฉํ•˜๋˜ ์„ธ์ œ์•ก์ด ๋ฐฐ์ถœ๋˜๋Š” ์ ˆ๋‹จ๋ถ€ ์ชฝ์œผ๋กœ ๋‚ด๋ฒฝ์„ ํ˜‘์†Œํ•˜๊ฒŒ ํ˜•์„ฑํ•˜์—ฌ์„œ ๋‚ด๋ถ€์— ๋“ค์–ด์žˆ๋Š” ์„ธ์ œ์•ก์„ ์ž˜์งœ์งˆ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ํ•ฉ์„ฑ์„ธ์ œ ์•กํฌ์— ๊ด€ํ•œ ๊ฒƒ์ด๋‹ค."  
tokens = tokenizer.tokenize(example)  
encoded_tokens = tokenizer.convert_tokens_to_ids(tokens)  
decoded_tokens = tokenizer.convert_ids_to_tokens(encoded_tokens)  
  
print("Input example ===>", example)  
print("Tokenized example ===>", tokens)  
print("Converted example to IDs ===>", encoded_tokens)  
print("Converted IDs to example ===>", decoded_tokens)

============ Result ================
Input example ===> ๋ณธ ๊ณ ์•ˆ์€ ์ฃผ๋กœ ์ผํšŒ์šฉ ํ•ฉ์„ฑ์„ธ์ œ์•ก์„ ์ง‘์–ด๋„ฃ์–ด ๋ฐ€๋ด‰ํ•˜๋Š” ์„ธ์ œ์•กํฌ์˜ ๋‚ด๋ถ€๋ฅผ ์›ํ˜ธ์ƒ์œผ๋กœ ์—ด์ค‘์ฐฉํ•˜๋˜ ์„ธ์ œ์•ก์ด ๋ฐฐ์ถœ๋˜๋Š” ์ ˆ๋‹จ๋ถ€ ์ชฝ์œผ๋กœ ๋‚ด๋ฒฝ์„ ํ˜‘์†Œํ•˜๊ฒŒ ํ˜•์„ฑํ•˜์—ฌ์„œ ๋‚ด๋ถ€์— ๋“ค์–ด์žˆ๋Š” ์„ธ์ œ์•ก์„ ์ž˜์งœ์งˆ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ํ•ฉ์„ฑ์„ธ์ œ ์•กํฌ์— ๊ด€ํ•œ ๊ฒƒ์ด๋‹ค.
Tokenized example ===> ['๋ณธ', '๊ณ ์•ˆ', '์€', '์ฃผ๋กœ', '์ผํšŒ์šฉ', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '##์•ก', '์„', '์ง‘', '##์–ด', '##๋„ฃ', '์–ด', '๋ฐ€๋ด‰', 'ํ•˜', '๋Š”', '์„ธ์ œ', '##์•ก', '##ํฌ', '์˜', '๋‚ด๋ถ€', '๋ฅผ', '์›ํ˜ธ', '์ƒ', '์œผ๋กœ', '์—ด', '##์ค‘', '์ฐฉ', '##ํ•˜', '๋˜', '์„ธ์ œ', '##์•ก', '์ด', '๋ฐฐ์ถœ', '๋˜', '๋Š”', '์ ˆ๋‹จ๋ถ€', '์ชฝ', '์œผ๋กœ', '๋‚ด๋ฒฝ', '์„', 'ํ˜‘', '##์†Œ', 'ํ•˜', '๊ฒŒ', 'ํ˜•์„ฑ', 'ํ•˜', '์—ฌ์„œ', '๋‚ด๋ถ€', '์—', '๋“ค', '์–ด', '์žˆ', '๋Š”', '์„ธ์ œ', '##์•ก', '์„', '์ž˜', '์งœ', '์งˆ', '์ˆ˜', '์žˆ', '๋„๋ก', 'ํ•˜', '๋Š”', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '์•ก', '##ํฌ', '์—', '๊ด€ํ•œ', '๊ฒƒ', '์ด', '๋‹ค', '.']
Converted example to IDs ===> [59, 619, 30, 2336, 8268, 819, 14100, 13986, 14198, 15, 732, 13994, 14615, 39, 1964, 12, 11, 6174, 14198, 14061, 9, 366, 16, 7267, 18, 32, 307, 14072, 891, 13967, 27, 6174, 14198, 14, 698, 27, 11, 12920, 1972, 32, 4482, 15, 2228, 14053, 12, 65, 117, 12, 4477, 366, 10, 56, 39, 26, 11, 6174, 14198, 15, 1637, 13709, 398, 25, 26, 140, 12, 11, 819, 14100, 13986, 377, 14061, 10, 487, 55, 14, 17, 13]
Converted IDs to example ===> ['๋ณธ', '๊ณ ์•ˆ', '์€', '์ฃผ๋กœ', '์ผํšŒ์šฉ', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '##์•ก', '์„', '์ง‘', '##์–ด', '##๋„ฃ', '์–ด', '๋ฐ€๋ด‰', 'ํ•˜', '๋Š”', '์„ธ์ œ', '##์•ก', '##ํฌ', '์˜', '๋‚ด๋ถ€', '๋ฅผ', '์›ํ˜ธ', '์ƒ', '์œผ๋กœ', '์—ด', '##์ค‘', '์ฐฉ', '##ํ•˜', '๋˜', '์„ธ์ œ', '##์•ก', '์ด', '๋ฐฐ์ถœ', '๋˜', '๋Š”', '์ ˆ๋‹จ๋ถ€', '์ชฝ', '์œผ๋กœ', '๋‚ด๋ฒฝ', '์„', 'ํ˜‘', '##์†Œ', 'ํ•˜', '๊ฒŒ', 'ํ˜•์„ฑ', 'ํ•˜', '์—ฌ์„œ', '๋‚ด๋ถ€', '์—', '๋“ค', '์–ด', '์žˆ', '๋Š”', '์„ธ์ œ', '##์•ก', '์„', '์ž˜', '์งœ', '์งˆ', '์ˆ˜', '์žˆ', '๋„๋ก', 'ํ•˜', '๋Š”', 'ํ•ฉ์„ฑ', '##์„ธ', '##์ œ', '์•ก', '##ํฌ', '์—', '๊ด€ํ•œ', '๊ฒƒ', '์ด', '๋‹ค', '.']

Fine-tuning with KorSci-Bert

  • Google Bert์˜ Fine-tuning ๋ฐฉ๋ฒ• ์ฐธ๊ณ 
  • Sentence (and sentence-pair) classification tasks: "run_classifier.py" ์ฝ”๋“œ ํ™œ์šฉ
  • MRC(Machine Reading Comprehension) tasks: "run_squad.py" ์ฝ”๋“œ ํ™œ์šฉ
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .