|
--- |
|
license: apache-2.0 |
|
language: |
|
- ko |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
# formal_classifier |
|
formal classifier or honorific classifier |
|
|
|
## νκ΅μ΄ μ‘΄λλ§ λ°λ§ λΆλ₯κΈ° |
|
|
|
μ€λμ μ μ‘΄λλ§ , λ°λ§μ νκ΅μ΄ ννμ λΆμκΈ°λ‘ λΆλ₯νλ κ°λ¨ν λ°©λ²μ μκ°νλ€.<br> |
|
νμ§λ§ μ΄ λ°©λ²μ μ€μ λ‘ μ μ©νλ € νλλ, λ§μ λΆλΆμμ μ€λ₯κ° λ°μνμλ€. |
|
|
|
μλ₯Ό λ€λ©΄) |
|
```bash |
|
μ λ²μ κ΅μλκ»μ μλ£ κ°μ Έμ€λΌνλλ° κΈ°μ΅λ? |
|
``` |
|
λΌλ 문ꡬλ₯Ό "κ»μ"λΌλ μ‘΄μΉλλ¬Έμ μ 체문μ₯μ μ‘΄λλ§λ‘ νλ¨νλ μ€λ₯κ° λ§μ΄ λ°μνλ€. <br> |
|
κ·Έλμ μ΄λ²μ λ₯λ¬λ λͺ¨λΈμ λ§λ€κ³ κ·Έ κ³Όμ μ 곡μ ν΄λ³΄κ³ μνλ€. |
|
|
|
#### λΉ λ₯΄κ² κ°μ Έλ€ μ°μ€ λΆλ€μ μλ μ½λλ‘ λ°λ‘ μ¬μ©νμ€ μ μμ΅λλ€. |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline |
|
|
|
model = AutoModelForSequenceClassification.from_pretrained("j5ng/kcbert-formal-classifier") |
|
tokenizer = AutoTokenizer.from_pretrained('j5ng/kcbert-formal-classifier') |
|
|
|
formal_classifier = pipeline(task="text-classification", model=model, tokenizer=tokenizer) |
|
print(formal_classifier("μ λ²μ κ΅μλκ»μ μλ£ κ°μ Έμ€λΌνλλ° κΈ°μ΅λ?")) |
|
# [{'label': 'LABEL_0', 'score': 0.9999139308929443}] |
|
``` |
|
|
|
*** |
|
|
|
### λ°μ΄ν° μ
μΆμ² |
|
|
|
#### μ€λ§μΌκ²μ΄νΈ λ§ν¬ λ°μ΄ν° μ
(korean SmileStyle Dataset) |
|
: https://github.com/smilegate-ai/korean_smile_style_dataset |
|
|
|
#### AI νλΈ κ°μ± λν λ§λμΉ |
|
: https://www.aihub.or.kr/ |
|
|
|
#### λ°μ΄ν°μ
λ€μ΄λ‘λ(AIνλΈλ μ§μ λ€μ΄λ‘λλ§ κ°λ₯) |
|
```bash |
|
wget https://raw.githubusercontent.com/smilegate-ai/korean_smile_style_dataset/main/smilestyle_dataset.tsv |
|
``` |
|
|
|
### κ°λ° νκ²½ |
|
```bash |
|
Python3.9 |
|
``` |
|
|
|
```bash |
|
torch==1.13.1 |
|
transformers==4.26.0 |
|
pandas==1.5.3 |
|
emoji==2.2.0 |
|
soynlp==0.0.493 |
|
datasets==2.10.1 |
|
pandas==1.5.3 |
|
``` |
|
|
|
|
|
#### μ¬μ© λͺ¨λΈ |
|
beomi/kcbert-base |
|
- GitHub : https://github.com/Beomi/KcBERT |
|
- HuggingFace : https://huggingface.co/beomi/kcbert-base |
|
*** |
|
|
|
## λ°μ΄ν° |
|
```bash |
|
get_train_data.py |
|
``` |
|
|
|
### μμ |
|
|sentence|label| |
|
|------|---| |
|
|곡λΆλ₯Ό μ΄μ¬ν ν΄λ μ΄μ¬ν ν λ§νΌ μ±μ μ΄ μ λμ€μ§ μμ|0| |
|
|μλ€μκ² λ³΄λ΄λ λ¬Έμλ₯Ό ν΅ν΄ κ΄κ³κ° ν볡λκΈΈ λ°λκ²μ|1| |
|
|μ°Έ μ΄μ¬ν μ¬μ 보λμ΄ μμΌμλ€μ|1| |
|
|λλ μ€μ μ’μν¨ μ΄λ² λ¬λΆν° μκ΅ κ° λ―|0| |
|
|λ³ΈλΆμ₯λμ΄ λ΄κ° ν μ μλ μ
무λ₯Ό κ³μ μ£Όμ
μ νλ€μ΄|0| |
|
|
|
|
|
### λΆν¬ |
|
|label|train|test| |
|
|------|---|---| |
|
|0|133,430|34,908| |
|
|1|112,828|29,839| |
|
|
|
*** |
|
|
|
## νμ΅(train) |
|
```bash |
|
python3 modeling/train.py |
|
``` |
|
|
|
*** |
|
|
|
## μμΈ‘(inference) |
|
```bash |
|
python3 inference.py |
|
``` |
|
|
|
```python |
|
def formal_percentage(self, text): |
|
return round(float(self.predict(text)[0][1]), 2) |
|
|
|
def print_message(self, text): |
|
result = self.formal_persentage(text) |
|
if result > 0.5: |
|
print(f'{text} : μ‘΄λλ§μ
λλ€. ( νλ₯ {result*100}% )') |
|
if result < 0.5: |
|
print(f'{text} : λ°λ§μ
λλ€. ( νλ₯ {((1 - result)*100)}% )') |
|
``` |
|
|
|
κ²°κ³Ό |
|
``` |
|
μ λ²μ κ΅μλκ»μ μλ£ κ°μ Έμ€λΌνμ
¨λλ° κΈ°μ΅λμΈμ? : μ‘΄λλ§μ
λλ€. ( νλ₯ 99.19% ) |
|
μ λ²μ κ΅μλκ»μ μλ£ κ°μ Έμ€λΌνλλ° κΈ°μ΅λ? : λ°λ§μ
λλ€. ( νλ₯ 92.86% ) |
|
``` |
|
|
|
|
|
|
|
*** |
|
|
|
## μΈμ© |
|
```bash |
|
@misc{SmilegateAI2022KoreanSmileStyleDataset, |
|
title = {SmileStyle: Parallel Style-variant Corpus for Korean Multi-turn Chat Text Dataset}, |
|
author = {Seonghyun Kim}, |
|
year = {2022}, |
|
howpublished = {\url{https://github.com/smilegate-ai/korean_smile_style_dataset}}, |
|
} |
|
``` |
|
|
|
```bash |
|
@inproceedings{lee2020kcbert, |
|
title={KcBERT: Korean Comments BERT}, |
|
author={Lee, Junbum}, |
|
booktitle={Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology}, |
|
pages={437--440}, |
|
year={2020} |
|
} |
|
``` |
|
|