File size: 5,112 Bytes
9df0fbd
d9044c0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e88f09b
c3d84ef
9df0fbd
 
8a506cc
a6dab10
a8e8473
8706234
 
6bba8bc
8706234
a8e8473
 
 
 
b68aff5
a8e8473
 
 
 
 
b68aff5
a8e8473
 
 
 
2b59c5f
 
a8e8473
 
 
 
 
 
 
29503cd
a8e8473
 
1a6d7e9
a8e8473
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e74e017
 
a8e8473
6bba8bc
 
 
 
 
b68aff5
a8e8473
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1a30c5
a8e8473
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
language: 
- multilingual
- en 
- ko
- ar 
- bg 
- de 
- el 
- es 
- fr 
- hi 
- ru 
- sw 
- th 
- tr 
- ur 
- vi 
- zh
tags: 
  - deberta
  - deberta-v3
  - mdeberta
  - korean
  - pretraining
license: mit
---

# mDeBERTa-v3-base-kor-further

> ๐Ÿ’ก ์•„๋ž˜ ํ”„๋กœ์ ํŠธ๋Š”ย KPMG Lighthouse Korea์—์„œ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.   
> KPMG Lighthouse Korea์—์„œ๋Š”, Financial area์˜ ๋‹ค์–‘ํ•œ ๋ฌธ์ œ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Edge Technology์˜ NLP/Vision AI๋ฅผ ๋ชจ๋ธ๋งํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
> https://kpmgkr.notion.site/

## What is DeBERTa?
- [DeBERTa](https://arxiv.org/abs/2006.03654)๋Š” `Disentangled Attention` + `Enhanced Mask Decoder` ๋ฅผ ์ ์šฉํ•˜์—ฌ ๋‹จ์–ด์˜ positional information์„ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ด์™€ ๊ฐ™์€ ์•„์ด๋””์–ด๋ฅผ ํ†ตํ•ด, ๊ธฐ์กด์˜ BERT, RoBERTa์—์„œ ์‚ฌ์šฉํ–ˆ๋˜ absolute position embedding๊ณผ๋Š” ๋‹ฌ๋ฆฌ DeBERTa๋Š” ๋‹จ์–ด์˜ ์ƒ๋Œ€์ ์ธ ์œ„์น˜ ์ •๋ณด๋ฅผ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜์—ฌ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ, BERT, RoBERTA ์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ ๋” ์ค€์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
- [DeBERTa-v3](https://arxiv.org/abs/2111.09543)์—์„œ๋Š”, ์ด์ „ ๋ฒ„์ „์—์„œ ์‚ฌ์šฉํ–ˆ๋˜ MLM (Masked Language Model) ์„ RTD (Replaced Token Detection) Task ๋กœ ๋Œ€์ฒดํ•œ ELECTRA ์Šคํƒ€์ผ์˜ ์‚ฌ์ „ํ•™์Šต ๋ฐฉ๋ฒ•๊ณผ, Gradient-Disentangled Embedding Sharing ์„ ์ ์šฉํ•˜์—ฌ ๋ชจ๋ธ ํ•™์Šต์˜ ํšจ์œจ์„ฑ์„ ๊ฐœ์„ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
- DeBERTa์˜ ์•„ํ‚คํ…์ฒ˜๋กœ ํ’๋ถ€ํ•œ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด์„œ,  `mDeBERTa-v3-base-kor-further` ๋Š” microsoft ๊ฐ€ ๋ฐœํ‘œํ•œ `mDeBERTa-v3-base` ๋ฅผ ์•ฝ 40GB์˜ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ **์ถ”๊ฐ€์ ์ธ ์‚ฌ์ „ํ•™์Šต**์„ ์ง„ํ–‰ํ•œ ์–ธ์–ด ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
  
## How to Use
- Requirements
    ```
    pip install transformers
    pip install sentencepiece
    ```   
- Huggingface Hub
    ```python
    from transformers import AutoModel, AutoTokenizer
    
    model = AutoModel.from_pretrained("lighthouse/mdeberta-v3-base-kor-further")  # DebertaV2ForModel
    tokenizer = AutoTokenizer.from_pretrained("lighthouse/mdeberta-v3-base-kor-further")  # DebertaV2Tokenizer (SentencePiece)
    ```

## Pre-trained Models
- ๋ชจ๋ธ์˜ ์•„ํ‚คํ…์ฒ˜๋Š” ๊ธฐ์กด microsoft์—์„œ ๋ฐœํ‘œํ•œ `mdeberta-v3-base`์™€ ๋™์ผํ•œ ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.
    
    |  | Vocabulary(K) | Backbone Parameters(M) | Hidden Size | Layers | Note |
    | --- | --- | --- | --- | --- | --- |
    | mdeberta-v3-base-kor-further (mdeberta-v3-base์™€ ๋™์ผ) | 250 | 86 | 768 | 12 | 250K new SPM vocab |

## Further Pretraing Details (MLM Task)
- `mDeBERTa-v3-base-kor-further` ๋Š” `microsoft/mDeBERTa-v3-base` ๋ฅผ ์•ฝ 40GB์˜ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ MLM Task๋ฅผ ์ ์šฉํ•˜์—ฌ ์ถ”๊ฐ€์ ์ธ ์‚ฌ์ „ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
    
    |  | Max length | Learning Rate | Batch Size | Train Steps | Warm-up Steps |
    | --- | --- | --- | --- | --- | --- |
    | mdeberta-v3-base-kor-further | 512 | 2e-5 | 8 | 5M | 50k |
    

## Datasets
- ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜(์‹ ๋ฌธ, ๊ตฌ์–ด, ๋ฌธ์–ด), ํ•œ๊ตญ์–ด Wiki, ๊ตญ๋ฏผ์ฒญ์› ๋“ฑ ์•ฝ 40 GB ์˜ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์…‹์ด ์ถ”๊ฐ€์ ์ธ ์‚ฌ์ „ํ•™์Šต์— ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
    - Train: 10M lines, 5B tokens
    - Valid: 2M lines, 1B tokens
    - cf) ๊ธฐ์กด mDeBERTa-v3์€ XLM-R ๊ณผ ๊ฐ™์ด [cc-100 ๋ฐ์ดํ„ฐ์…‹](https://data.statmt.org/cc-100/)์œผ๋กœ ํ•™์Šต๋˜์—ˆ์œผ๋ฉฐ, ๊ทธ ์ค‘ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์…‹์˜ ํฌ๊ธฐ๋Š” 54GB์ž…๋‹ˆ๋‹ค.
    

## Fine-tuning on NLU Tasks - Base Model
| Model | Size | NSMC(acc) | Naver NER(F1) | PAWS (acc) | KorNLI (acc) | KorSTS (spearman) | Question Pair (acc) | KorQuaD (Dev) (EM/F1) | Korean-Hate-Speech (Dev) (F1) |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| XLM-Roberta-Base | 1.03G | 89.03 | 86.65 | 82.80 | 80.23 | 78.45 | 93.80 | 64.70 / 88.94 | 64.06 |
| mdeberta-base | 534M | 90.01 | 87.43 | 85.55 | 80.41 | **82.65** | 94.06 | 65.48 / 89.74 | 62.91 |
| mdeberta-base-kor-further (Ours) | 534M | **90.52** | **87.87** | **85.85** | **80.65** | 81.90 | **94.98** | **66.07 / 90.35** | **68.16** |


## KPMG Lighthouse KR
https://kpmgkr.notion.site/ 


## Citation
```
@misc{he2021debertav3,
      title={DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing}, 
      author={Pengcheng He and Jianfeng Gao and Weizhu Chen},
      year={2021},
      eprint={2111.09543},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

```
@inproceedings{
he2021deberta,
title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}
```

## Reference
- [mDeBERTa-v3-base-kor-further](https://github.com/kpmg-kr/mDeBERTa-v3-base-kor-further)
- [DeBERTa](https://github.com/microsoft/DeBERTa)
- [Huggingface Transformers](https://github.com/huggingface/transformers)
- [๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜](https://corpus.korean.go.kr/)
- [Korpora: Korean Corpora Archives](https://github.com/ko-nlp/Korpora)
- [sooftware/Korean PLM](https://github.com/sooftware/Korean-PLM)