j5ng's picture
Update README.md
e92823b
|
raw
history blame
3.9 kB
metadata
license: apache-2.0
language:
  - ko
pipeline_tag: text-classification

formal_classifier

formal classifier or honorific classifier

ํ•œ๊ตญ์–ด ์กด๋Œ“๋ง ๋ฐ˜๋ง ๋ถ„๋ฅ˜๊ธฐ

์˜ค๋ž˜์ „์— ์กด๋Œ“๋ง , ๋ฐ˜๋ง์„ ํ•œ๊ตญ์–ด ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ–ˆ๋‹ค.
ํ•˜์ง€๋งŒ ์ด ๋ฐฉ๋ฒ•์„ ์‹ค์ œ๋กœ ์ ์šฉํ•˜๋ ค ํ–ˆ๋”๋‹ˆ, ๋งŽ์€ ๋ถ€๋ถ„์—์„œ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜์˜€๋‹ค.

์˜ˆ๋ฅผ ๋“ค๋ฉด)

์ €๋ฒˆ์— ๊ต์ˆ˜๋‹˜๊ป˜์„œ ์ž๋ฃŒ ๊ฐ€์ ธ์˜ค๋ผํ–ˆ๋Š”๋ฐ ๊ธฐ์–ต๋‚˜?

๋ผ๋Š” ๋ฌธ๊ตฌ๋ฅผ "๊ป˜์„œ"๋ผ๋Š” ์กด์นญ๋•Œ๋ฌธ์— ์ „์ฒด๋ฌธ์žฅ์„ ์กด๋Œ“๋ง๋กœ ํŒ๋‹จํ•˜๋Š” ์˜ค๋ฅ˜๊ฐ€ ๋งŽ์ด ๋ฐœ์ƒํ–ˆ๋‹ค.
๊ทธ๋ž˜์„œ ์ด๋ฒˆ์— ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ๊ทธ ๊ณผ์ •์„ ๊ณต์œ ํ•ด๋ณด๊ณ ์žํ•œ๋‹ค.

๋น ๋ฅด๊ฒŒ ๊ฐ€์ ธ๋‹ค ์“ฐ์‹ค ๋ถ„๋“ค์€ ์•„๋ž˜ ์ฝ”๋“œ๋กœ ๋ฐ”๋กœ ์‚ฌ์šฉํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

model = AutoModelForSequenceClassification.from_pretrained("j5ng/kcbert-formal-classifier")
tokenizer = AutoTokenizer.from_pretrained('j5ng/kcbert-formal-classifier')

formal_classifier = pipeline(task="text-classification", model=model, tokenizer=tokenizer)
print(formal_classifier("์ €๋ฒˆ์— ๊ต์ˆ˜๋‹˜๊ป˜์„œ ์ž๋ฃŒ ๊ฐ€์ ธ์˜ค๋ผํ–ˆ๋Š”๋ฐ ๊ธฐ์–ต๋‚˜?")) 
# [{'label': 'LABEL_0', 'score': 0.9999139308929443}]

๋ฐ์ดํ„ฐ ์…‹ ์ถœ์ฒ˜

์Šค๋งˆ์ผ๊ฒŒ์ดํŠธ ๋งํˆฌ ๋ฐ์ดํ„ฐ ์…‹(korean SmileStyle Dataset)

: https://github.com/smilegate-ai/korean_smile_style_dataset

AI ํ—ˆ๋ธŒ ๊ฐ์„ฑ ๋Œ€ํ™” ๋ง๋ญ‰์น˜

: https://www.aihub.or.kr/

๋ฐ์ดํ„ฐ์…‹ ๋‹ค์šด๋กœ๋“œ(AIํ—ˆ๋ธŒ๋Š” ์ง์ ‘๋‹ค์šด๋กœ๋“œ๋งŒ ๊ฐ€๋Šฅ)

wget https://raw.githubusercontent.com/smilegate-ai/korean_smile_style_dataset/main/smilestyle_dataset.tsv

๊ฐœ๋ฐœ ํ™˜๊ฒฝ

Python3.9
torch==1.13.1
transformers==4.26.0
pandas==1.5.3
emoji==2.2.0
soynlp==0.0.493
datasets==2.10.1
pandas==1.5.3

์‚ฌ์šฉ ๋ชจ๋ธ

beomi/kcbert-base


๋ฐ์ดํ„ฐ

get_train_data.py

์˜ˆ์‹œ

sentence label
๊ณต๋ถ€๋ฅผ ์—ด์‹ฌํžˆ ํ•ด๋„ ์—ด์‹ฌํžˆ ํ•œ ๋งŒํผ ์„ฑ์ ์ด ์ž˜ ๋‚˜์˜ค์ง€ ์•Š์•„ 0
์•„๋“ค์—๊ฒŒ ๋ณด๋‚ด๋Š” ๋ฌธ์ž๋ฅผ ํ†ตํ•ด ๊ด€๊ณ„๊ฐ€ ํšŒ๋ณต๋˜๊ธธ ๋ฐ”๋ž„๊ฒŒ์š” 1
์ฐธ ์—ด์‹ฌํžˆ ์‚ฌ์‹  ๋ณด๋žŒ์ด ์žˆ์œผ์‹œ๋„ค์š” 1
๋‚˜๋„ ์Šค์‹œ ์ข‹์•„ํ•จ ์ด๋ฒˆ ๋‹ฌ๋ถ€ํ„ฐ ์˜๊ตญ ๊ฐˆ ๋“ฏ 0
๋ณธ๋ถ€์žฅ๋‹˜์ด ๋‚ด๊ฐ€ ํ•  ์ˆ˜ ์—†๋Š” ์—…๋ฌด๋ฅผ ๊ณ„์† ์ฃผ์…”์„œ ํž˜๋“ค์–ด 0

๋ถ„ํฌ

label train test
0 133,430 34,908
1 112,828 29,839

ํ•™์Šต(train)

python3 modeling/train.py

์˜ˆ์ธก(inference)

python3 inference.py
def formal_percentage(self, text):
    return round(float(self.predict(text)[0][1]), 2)

def print_message(self, text):
    result = self.formal_persentage(text)
    if result > 0.5:
        print(f'{text} : ์กด๋Œ“๋ง์ž…๋‹ˆ๋‹ค. ( ํ™•๋ฅ  {result*100}% )')
    if result < 0.5:
        print(f'{text} : ๋ฐ˜๋ง์ž…๋‹ˆ๋‹ค. ( ํ™•๋ฅ  {((1 - result)*100)}% )')

๊ฒฐ๊ณผ

์ €๋ฒˆ์— ๊ต์ˆ˜๋‹˜๊ป˜์„œ ์ž๋ฃŒ ๊ฐ€์ ธ์˜ค๋ผํ•˜์…จ๋Š”๋ฐ ๊ธฐ์–ต๋‚˜์„ธ์š”? : ์กด๋Œ“๋ง์ž…๋‹ˆ๋‹ค. ( ํ™•๋ฅ  99.19% )
์ €๋ฒˆ์— ๊ต์ˆ˜๋‹˜๊ป˜์„œ ์ž๋ฃŒ ๊ฐ€์ ธ์˜ค๋ผํ–ˆ๋Š”๋ฐ ๊ธฐ์–ต๋‚˜? : ๋ฐ˜๋ง์ž…๋‹ˆ๋‹ค. ( ํ™•๋ฅ  92.86% )

์ธ์šฉ

@misc{SmilegateAI2022KoreanSmileStyleDataset,
  title         = {SmileStyle: Parallel Style-variant Corpus for Korean Multi-turn Chat Text Dataset},
  author        = {Seonghyun Kim},
  year          = {2022},
  howpublished  = {\url{https://github.com/smilegate-ai/korean_smile_style_dataset}},
}
@inproceedings{lee2020kcbert,
  title={KcBERT: Korean Comments BERT},
  author={Lee, Junbum},
  booktitle={Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology},
  pages={437--440},
  year={2020}
}