File size: 5,112 Bytes
9df0fbd d9044c0 e88f09b c3d84ef 9df0fbd 8a506cc a6dab10 a8e8473 8706234 6bba8bc 8706234 a8e8473 b68aff5 a8e8473 b68aff5 a8e8473 2b59c5f a8e8473 29503cd a8e8473 1a6d7e9 a8e8473 e74e017 a8e8473 6bba8bc b68aff5 a8e8473 a1a30c5 a8e8473 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
---
language:
- multilingual
- en
- ko
- ar
- bg
- de
- el
- es
- fr
- hi
- ru
- sw
- th
- tr
- ur
- vi
- zh
tags:
- deberta
- deberta-v3
- mdeberta
- korean
- pretraining
license: mit
---
# mDeBERTa-v3-base-kor-further
> ๐ก ์๋ ํ๋ก์ ํธ๋ย KPMG Lighthouse Korea์์ ์งํํ์์ต๋๋ค.
> KPMG Lighthouse Korea์์๋, Financial area์ ๋ค์ํ ๋ฌธ์ ๋ค์ ํด๊ฒฐํ๊ธฐ ์ํด Edge Technology์ NLP/Vision AI๋ฅผ ๋ชจ๋ธ๋งํ๊ณ ์์ต๋๋ค.
> https://kpmgkr.notion.site/
## What is DeBERTa?
- [DeBERTa](https://arxiv.org/abs/2006.03654)๋ `Disentangled Attention` + `Enhanced Mask Decoder` ๋ฅผ ์ ์ฉํ์ฌ ๋จ์ด์ positional information์ ํจ๊ณผ์ ์ผ๋ก ํ์ตํฉ๋๋ค. ์ด์ ๊ฐ์ ์์ด๋์ด๋ฅผ ํตํด, ๊ธฐ์กด์ BERT, RoBERTa์์ ์ฌ์ฉํ๋ absolute position embedding๊ณผ๋ ๋ฌ๋ฆฌ DeBERTa๋ ๋จ์ด์ ์๋์ ์ธ ์์น ์ ๋ณด๋ฅผ ํ์ต ๊ฐ๋ฅํ ๋ฒกํฐ๋ก ํํํ์ฌ ๋ชจ๋ธ์ ํ์ตํ๊ฒ ๋ฉ๋๋ค. ๊ฒฐ๊ณผ์ ์ผ๋ก, BERT, RoBERTA ์ ๋น๊ตํ์ ๋ ๋ ์ค์ํ ์ฑ๋ฅ์ ๋ณด์ฌ์ฃผ์์ต๋๋ค.
- [DeBERTa-v3](https://arxiv.org/abs/2111.09543)์์๋, ์ด์ ๋ฒ์ ์์ ์ฌ์ฉํ๋ MLM (Masked Language Model) ์ RTD (Replaced Token Detection) Task ๋ก ๋์ฒดํ ELECTRA ์คํ์ผ์ ์ฌ์ ํ์ต ๋ฐฉ๋ฒ๊ณผ, Gradient-Disentangled Embedding Sharing ์ ์ ์ฉํ์ฌ ๋ชจ๋ธ ํ์ต์ ํจ์จ์ฑ์ ๊ฐ์ ํ์์ต๋๋ค.
- DeBERTa์ ์ํคํ
์ฒ๋ก ํ๋ถํ ํ๊ตญ์ด ๋ฐ์ดํฐ๋ฅผ ํ์ตํ๊ธฐ ์ํด์, `mDeBERTa-v3-base-kor-further` ๋ microsoft ๊ฐ ๋ฐํํ `mDeBERTa-v3-base` ๋ฅผ ์ฝ 40GB์ ํ๊ตญ์ด ๋ฐ์ดํฐ์ ๋ํด์ **์ถ๊ฐ์ ์ธ ์ฌ์ ํ์ต**์ ์งํํ ์ธ์ด ๋ชจ๋ธ์
๋๋ค.
## How to Use
- Requirements
```
pip install transformers
pip install sentencepiece
```
- Huggingface Hub
```python
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("lighthouse/mdeberta-v3-base-kor-further") # DebertaV2ForModel
tokenizer = AutoTokenizer.from_pretrained("lighthouse/mdeberta-v3-base-kor-further") # DebertaV2Tokenizer (SentencePiece)
```
## Pre-trained Models
- ๋ชจ๋ธ์ ์ํคํ
์ฒ๋ ๊ธฐ์กด microsoft์์ ๋ฐํํ `mdeberta-v3-base`์ ๋์ผํ ๊ตฌ์กฐ์
๋๋ค.
| | Vocabulary(K) | Backbone Parameters(M) | Hidden Size | Layers | Note |
| --- | --- | --- | --- | --- | --- |
| mdeberta-v3-base-kor-further (mdeberta-v3-base์ ๋์ผ) | 250 | 86 | 768 | 12 | 250K new SPM vocab |
## Further Pretraing Details (MLM Task)
- `mDeBERTa-v3-base-kor-further` ๋ `microsoft/mDeBERTa-v3-base` ๋ฅผ ์ฝ 40GB์ ํ๊ตญ์ด ๋ฐ์ดํฐ์ ๋ํด์ MLM Task๋ฅผ ์ ์ฉํ์ฌ ์ถ๊ฐ์ ์ธ ์ฌ์ ํ์ต์ ์งํํ์์ต๋๋ค.
| | Max length | Learning Rate | Batch Size | Train Steps | Warm-up Steps |
| --- | --- | --- | --- | --- | --- |
| mdeberta-v3-base-kor-further | 512 | 2e-5 | 8 | 5M | 50k |
## Datasets
- ๋ชจ๋์ ๋ง๋ญ์น(์ ๋ฌธ, ๊ตฌ์ด, ๋ฌธ์ด), ํ๊ตญ์ด Wiki, ๊ตญ๋ฏผ์ฒญ์ ๋ฑ ์ฝ 40 GB ์ ํ๊ตญ์ด ๋ฐ์ดํฐ์
์ด ์ถ๊ฐ์ ์ธ ์ฌ์ ํ์ต์ ์ฌ์ฉ๋์์ต๋๋ค.
- Train: 10M lines, 5B tokens
- Valid: 2M lines, 1B tokens
- cf) ๊ธฐ์กด mDeBERTa-v3์ XLM-R ๊ณผ ๊ฐ์ด [cc-100 ๋ฐ์ดํฐ์
](https://data.statmt.org/cc-100/)์ผ๋ก ํ์ต๋์์ผ๋ฉฐ, ๊ทธ ์ค ํ๊ตญ์ด ๋ฐ์ดํฐ์
์ ํฌ๊ธฐ๋ 54GB์
๋๋ค.
## Fine-tuning on NLU Tasks - Base Model
| Model | Size | NSMC(acc) | Naver NER(F1) | PAWS (acc) | KorNLI (acc) | KorSTS (spearman) | Question Pair (acc) | KorQuaD (Dev) (EM/F1) | Korean-Hate-Speech (Dev) (F1) |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| XLM-Roberta-Base | 1.03G | 89.03 | 86.65 | 82.80 | 80.23 | 78.45 | 93.80 | 64.70 / 88.94 | 64.06 |
| mdeberta-base | 534M | 90.01 | 87.43 | 85.55 | 80.41 | **82.65** | 94.06 | 65.48 / 89.74 | 62.91 |
| mdeberta-base-kor-further (Ours) | 534M | **90.52** | **87.87** | **85.85** | **80.65** | 81.90 | **94.98** | **66.07 / 90.35** | **68.16** |
## KPMG Lighthouse KR
https://kpmgkr.notion.site/
## Citation
```
@misc{he2021debertav3,
title={DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing},
author={Pengcheng He and Jianfeng Gao and Weizhu Chen},
year={2021},
eprint={2111.09543},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
```
@inproceedings{
he2021deberta,
title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}
```
## Reference
- [mDeBERTa-v3-base-kor-further](https://github.com/kpmg-kr/mDeBERTa-v3-base-kor-further)
- [DeBERTa](https://github.com/microsoft/DeBERTa)
- [Huggingface Transformers](https://github.com/huggingface/transformers)
- [๋ชจ๋์ ๋ง๋ญ์น](https://corpus.korean.go.kr/)
- [Korpora: Korean Corpora Archives](https://github.com/ko-nlp/Korpora)
- [sooftware/Korean PLM](https://github.com/sooftware/Korean-PLM) |