j5ng's picture
Update README.md
e92823b
|
raw
history blame
3.9 kB
---
license: apache-2.0
language:
- ko
pipeline_tag: text-classification
---
# formal_classifier
formal classifier or honorific classifier
## ν•œκ΅­μ–΄ μ‘΄λŒ“λ§ 반말 λΆ„λ₯˜κΈ°
μ˜€λž˜μ „μ— μ‘΄λŒ“λ§ , λ°˜λ§μ„ ν•œκ΅­μ–΄ ν˜•νƒœμ†Œ λΆ„μ„κΈ°λ‘œ λΆ„λ₯˜ν•˜λŠ” κ°„λ‹¨ν•œ 방법을 μ†Œκ°œν–ˆλ‹€.<br>
ν•˜μ§€λ§Œ 이 방법을 μ‹€μ œλ‘œ μ μš©ν•˜λ € ν–ˆλ”λ‹ˆ, λ§Žμ€ λΆ€λΆ„μ—μ„œ 였λ₯˜κ°€ λ°œμƒν•˜μ˜€λ‹€.
예λ₯Ό λ“€λ©΄)
```bash
μ €λ²ˆμ— κ΅μˆ˜λ‹˜κ»˜μ„œ 자료 κ°€μ Έμ˜€λΌν–ˆλŠ”λ° κΈ°μ–΅λ‚˜?
```
λΌλŠ” 문ꡬλ₯Ό "κ»˜μ„œ"λΌλŠ” μ‘΄μΉ­λ•Œλ¬Έμ— 전체문μž₯을 μ‘΄λŒ“λ§λ‘œ νŒλ‹¨ν•˜λŠ” 였λ₯˜κ°€ 많이 λ°œμƒν–ˆλ‹€. <br>
κ·Έλž˜μ„œ μ΄λ²ˆμ— λ”₯λŸ¬λ‹ λͺ¨λΈμ„ λ§Œλ“€κ³  κ·Έ 과정을 κ³΅μœ ν•΄λ³΄κ³ μžν•œλ‹€.
#### λΉ λ₯΄κ²Œ κ°€μ Έλ‹€ μ“°μ‹€ 뢄듀은 μ•„λž˜ μ½”λ“œλ‘œ λ°”λ‘œ μ‚¬μš©ν•˜μ‹€ 수 μžˆμŠ΅λ‹ˆλ‹€.
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
model = AutoModelForSequenceClassification.from_pretrained("j5ng/kcbert-formal-classifier")
tokenizer = AutoTokenizer.from_pretrained('j5ng/kcbert-formal-classifier')
formal_classifier = pipeline(task="text-classification", model=model, tokenizer=tokenizer)
print(formal_classifier("μ €λ²ˆμ— κ΅μˆ˜λ‹˜κ»˜μ„œ 자료 κ°€μ Έμ˜€λΌν–ˆλŠ”λ° κΈ°μ–΅λ‚˜?"))
# [{'label': 'LABEL_0', 'score': 0.9999139308929443}]
```
***
### 데이터 μ…‹ 좜처
#### 슀마일게이트 말투 데이터 μ…‹(korean SmileStyle Dataset)
: https://github.com/smilegate-ai/korean_smile_style_dataset
#### AI ν—ˆλΈŒ 감성 λŒ€ν™” λ§λ­‰μΉ˜
: https://www.aihub.or.kr/
#### 데이터셋 λ‹€μš΄λ‘œλ“œ(AIν—ˆλΈŒλŠ” μ§μ ‘λ‹€μš΄λ‘œλ“œλ§Œ κ°€λŠ₯)
```bash
wget https://raw.githubusercontent.com/smilegate-ai/korean_smile_style_dataset/main/smilestyle_dataset.tsv
```
### 개발 ν™˜κ²½
```bash
Python3.9
```
```bash
torch==1.13.1
transformers==4.26.0
pandas==1.5.3
emoji==2.2.0
soynlp==0.0.493
datasets==2.10.1
pandas==1.5.3
```
#### μ‚¬μš© λͺ¨λΈ
beomi/kcbert-base
- GitHub : https://github.com/Beomi/KcBERT
- HuggingFace : https://huggingface.co/beomi/kcbert-base
***
## 데이터
```bash
get_train_data.py
```
### μ˜ˆμ‹œ
|sentence|label|
|------|---|
|곡뢀λ₯Ό μ—΄μ‹¬νžˆ 해도 μ—΄μ‹¬νžˆ ν•œ 만큼 성적이 잘 λ‚˜μ˜€μ§€ μ•Šμ•„|0|
|μ•„λ“€μ—κ²Œ λ³΄λ‚΄λŠ” 문자λ₯Ό 톡해 관계가 회볡되길 λ°”λž„κ²Œμš”|1|
|μ°Έ μ—΄μ‹¬νžˆ 사신 보람이 μžˆμœΌμ‹œλ„€μš”|1|
|λ‚˜λ„ μŠ€μ‹œ 쒋아함 이번 달뢀터 영ꡭ 갈 λ“―|0|
|λ³ΈλΆ€μž₯λ‹˜μ΄ λ‚΄κ°€ ν•  수 μ—†λŠ” 업무λ₯Ό 계속 μ£Όμ…”μ„œ νž˜λ“€μ–΄|0|
### 뢄포
|label|train|test|
|------|---|---|
|0|133,430|34,908|
|1|112,828|29,839|
***
## ν•™μŠ΅(train)
```bash
python3 modeling/train.py
```
***
## 예츑(inference)
```bash
python3 inference.py
```
```python
def formal_percentage(self, text):
return round(float(self.predict(text)[0][1]), 2)
def print_message(self, text):
result = self.formal_persentage(text)
if result > 0.5:
print(f'{text} : μ‘΄λŒ“λ§μž…λ‹ˆλ‹€. ( ν™•λ₯  {result*100}% )')
if result < 0.5:
print(f'{text} : λ°˜λ§μž…λ‹ˆλ‹€. ( ν™•λ₯  {((1 - result)*100)}% )')
```
κ²°κ³Ό
```
μ €λ²ˆμ— κ΅μˆ˜λ‹˜κ»˜μ„œ 자료 κ°€μ Έμ˜€λΌν•˜μ…¨λŠ”λ° κΈ°μ–΅λ‚˜μ„Έμš”? : μ‘΄λŒ“λ§μž…λ‹ˆλ‹€. ( ν™•λ₯  99.19% )
μ €λ²ˆμ— κ΅μˆ˜λ‹˜κ»˜μ„œ 자료 κ°€μ Έμ˜€λΌν–ˆλŠ”λ° κΈ°μ–΅λ‚˜? : λ°˜λ§μž…λ‹ˆλ‹€. ( ν™•λ₯  92.86% )
```
***
## 인용
```bash
@misc{SmilegateAI2022KoreanSmileStyleDataset,
title = {SmileStyle: Parallel Style-variant Corpus for Korean Multi-turn Chat Text Dataset},
author = {Seonghyun Kim},
year = {2022},
howpublished = {\url{https://github.com/smilegate-ai/korean_smile_style_dataset}},
}
```
```bash
@inproceedings{lee2020kcbert,
title={KcBERT: Korean Comments BERT},
author={Lee, Junbum},
booktitle={Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology},
pages={437--440},
year={2020}
}
```