File size: 4,892 Bytes

7760b1f

---
license: apache-2.0
language:
- ko
---

# Korean ALBERT

# Dataset
- [AI-HUB](https://www.aihub.or.kr/)
- [국립국어원 - 모두의 말뭉치](https://kli.korean.go.kr/corpus/main/requestMain.do?lang=ko)
- [Korean News Comments](https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments)


# Evaluation results
- The code for finetuning can be found at [KcBERT-Finetune](https://github.com/Beomi/KcBERT-finetune).
   
|                        | Size(용량) | Average Score | **NSMC**<br/>(acc) | **Naver NER**<br/>(F1) | **PAWS**<br/>(acc) | **KorNLI**<br/>(acc) | **KorSTS**<br/>(spearman) | **Question Pair**<br/>(acc) | **KorQuaD (Dev)**<br/>(EM/F1) |
|:---------------------- |:----------:|:-------------:|:------------------:|:----------------------:|:------------------:|:--------------------:|:-------------------------:|:---------------------------:|:-----------------------------:|
| KcELECTRA-base         |    475M    |     84.84     |       91.71        |         86.90          |       74.80        |        81.65         |           82.65           |          **95.78**          |         70.60 / 90.11         |
| KcELECTRA-base-v2022   |    475M    |     85.20     |     **91.97**      |       **87.35**        |       76.50        |      **82.12**       |         **83.67**         |            95.12            |         69.00 / 90.40         |
| KcBERT-Base            |    417M    |     79.65     |       89.62        |         84.34          |       66.95        |        74.85         |           75.57           |            93.93            |         60.25 / 84.39         |
| KcBERT-Large           |    1.2G    |     81.33     |       90.68        |         85.53          |       70.15        |        76.99         |           77.49           |            94.06            |         62.16 / 86.64         |
| KoBERT                 |    351M    |     82.21     |       89.63        |         86.11          |       80.65        |        79.00         |           79.64           |            93.93            |         52.81 / 80.27         |
| XLM-Roberta-Base       |   1.03G    |     84.01     |       89.49        |         86.26          |       82.95        |        79.92         |           79.09           |            93.53            |         64.70 / 88.94         |
| HanBERT                |    614M    |     86.24     |       90.16        |         87.31          |       82.40        |        80.89         |           83.33           |            94.19            |         78.74 / 92.02         |
| KoELECTRA-Base         |    423M    |     84.66     |       90.21        |         86.87          |       81.90        |        80.85         |           83.21           |            94.20            |         61.10 / 89.59         |
| KoELECTRA-Base-v2      |    423M    |   **86.96**   |       89.70        |         87.02          |     **83.90**      |        80.61         |           84.30           |            94.72            |       **84.34 / 92.58**       |
| DistilKoBERT           |    108M    |     76.76     |       88.41        |         84.13          |       62.55        |        70.55         |           73.21           |            92.48            |         54.12 / 77.80         |
| **ko-albert-base-v1**  |  **51M**   |     80.46     |       86.83        |         82.26          |       69.95        |        74.17         |           74.48           |            94.06            |         76.08 / 86.82         |
| **ko-albert-large-v1** |  **75M**   |     82.39     |       86.91        |         83.12          |       76.10        |        76.01         |           77.46           |            94.33            |         77.64 / 87.99         |

*The size of HanBERT is the sum of the BERT model and the tokenizer DB.

*These results were obtained using the default configuration settings. Better performance may be achieved with additional hyperparameter tuning.


# How to use

```python
from transformers import AutoTokenizer, AutoModel

# Base Model (51M)
tokenizer = AutoTokenizer.from_pretrained("lots-o/ko-albert-base-v1")
model = AutoModel.from_pretrained("lots-o/ko-albert-base-v1")

# Large Model (75M)
tokenizer = AutoTokenizer.from_pretrained("lots-o/ko-albert-large-v1")
model = AutoModel.from_pretrained("lots-o/ko-albert-large-v1")
```

# Acknowledgement
- The GCP/TPU environment used for training the ALBERT Model was supported by the [TRC](https://sites.research.google/trc/about/) program.

# Reference
## Paper
- [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942)

## Github Repos
- [google-albert](https://github.com/google-research/albert)
- [albert-zh](https://github.com/brightmart/albert_zh)
- [KcBERT](https://github.com/Beomi/KcBERT)
- [KcBERT-Finetune](https://github.com/Beomi/KcBERT-finetune)