File size: 3,901 Bytes
890e0dc
 
e92823b
 
 
890e0dc
e92823b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
license: apache-2.0
language:
- ko
pipeline_tag: text-classification
---

# formal_classifier
formal classifier or honorific classifier

## ν•œκ΅­μ–΄ μ‘΄λŒ“λ§ 반말 λΆ„λ₯˜κΈ°

μ˜€λž˜μ „μ— μ‘΄λŒ“λ§ , λ°˜λ§μ„ ν•œκ΅­μ–΄ ν˜•νƒœμ†Œ λΆ„μ„κΈ°λ‘œ λΆ„λ₯˜ν•˜λŠ” κ°„λ‹¨ν•œ 방법을 μ†Œκ°œν–ˆλ‹€.<br>
ν•˜μ§€λ§Œ 이 방법을 μ‹€μ œλ‘œ μ μš©ν•˜λ € ν–ˆλ”λ‹ˆ, λ§Žμ€ λΆ€λΆ„μ—μ„œ 였λ₯˜κ°€ λ°œμƒν•˜μ˜€λ‹€.

예λ₯Ό λ“€λ©΄)
```bash
μ €λ²ˆμ— κ΅μˆ˜λ‹˜κ»˜μ„œ 자료 κ°€μ Έμ˜€λΌν–ˆλŠ”λ° κΈ°μ–΅λ‚˜?
 ```
λΌλŠ” 문ꡬλ₯Ό "κ»˜μ„œ"λΌλŠ” μ‘΄μΉ­λ•Œλ¬Έμ— 전체문μž₯을 μ‘΄λŒ“λ§λ‘œ νŒλ‹¨ν•˜λŠ” 였λ₯˜κ°€ 많이 λ°œμƒν–ˆλ‹€. <br>
 κ·Έλž˜μ„œ μ΄λ²ˆμ— λ”₯λŸ¬λ‹ λͺ¨λΈμ„ λ§Œλ“€κ³  κ·Έ 과정을 κ³΅μœ ν•΄λ³΄κ³ μžν•œλ‹€.

#### λΉ λ₯΄κ²Œ κ°€μ Έλ‹€ μ“°μ‹€ 뢄듀은 μ•„λž˜ μ½”λ“œλ‘œ λ°”λ‘œ μ‚¬μš©ν•˜μ‹€ 수 μžˆμŠ΅λ‹ˆλ‹€.
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

model = AutoModelForSequenceClassification.from_pretrained("j5ng/kcbert-formal-classifier")
tokenizer = AutoTokenizer.from_pretrained('j5ng/kcbert-formal-classifier')

formal_classifier = pipeline(task="text-classification", model=model, tokenizer=tokenizer)
print(formal_classifier("μ €λ²ˆμ— κ΅μˆ˜λ‹˜κ»˜μ„œ 자료 κ°€μ Έμ˜€λΌν–ˆλŠ”λ° κΈ°μ–΅λ‚˜?")) 
# [{'label': 'LABEL_0', 'score': 0.9999139308929443}]
```

***

### 데이터 μ…‹ 좜처

#### 슀마일게이트 말투 데이터 μ…‹(korean SmileStyle Dataset)
 : https://github.com/smilegate-ai/korean_smile_style_dataset

#### AI ν—ˆλΈŒ 감성 λŒ€ν™” λ§λ­‰μΉ˜
 : https://www.aihub.or.kr/
 
 #### 데이터셋 λ‹€μš΄λ‘œλ“œ(AIν—ˆλΈŒλŠ” μ§μ ‘λ‹€μš΄λ‘œλ“œλ§Œ κ°€λŠ₯)
 ```bash
 wget https://raw.githubusercontent.com/smilegate-ai/korean_smile_style_dataset/main/smilestyle_dataset.tsv
 ```
 
 ### 개발 ν™˜κ²½
 ```bash
 Python3.9
 ```
 
 ```bash
torch==1.13.1
transformers==4.26.0
pandas==1.5.3
emoji==2.2.0
soynlp==0.0.493
datasets==2.10.1
pandas==1.5.3
 ```
 
 
 #### μ‚¬μš© λͺ¨λΈ 
 beomi/kcbert-base 
  - GitHub : https://github.com/Beomi/KcBERT
  - HuggingFace : https://huggingface.co/beomi/kcbert-base
***

## 데이터
```bash
get_train_data.py
```

### μ˜ˆμ‹œ
|sentence|label|
|------|---|
|곡뢀λ₯Ό μ—΄μ‹¬νžˆ 해도 μ—΄μ‹¬νžˆ ν•œ 만큼 성적이 잘 λ‚˜μ˜€μ§€ μ•Šμ•„|0|
|μ•„λ“€μ—κ²Œ λ³΄λ‚΄λŠ” 문자λ₯Ό 톡해 관계가 회볡되길 λ°”λž„κ²Œμš”|1|
|μ°Έ μ—΄μ‹¬νžˆ 사신 보람이 μžˆμœΌμ‹œλ„€μš”|1|
|λ‚˜λ„ μŠ€μ‹œ 쒋아함 이번 달뢀터 영ꡭ 갈 λ“―|0|
|λ³ΈλΆ€μž₯λ‹˜μ΄ λ‚΄κ°€ ν•  수 μ—†λŠ” 업무λ₯Ό 계속 μ£Όμ…”μ„œ νž˜λ“€μ–΄|0|


### 뢄포
|label|train|test|
|------|---|---|
|0|133,430|34,908|
|1|112,828|29,839|

***

## ν•™μŠ΅(train)
```bash
python3 modeling/train.py
```

***

## 예츑(inference)
```bash
python3 inference.py
```

```python
def formal_percentage(self, text):
    return round(float(self.predict(text)[0][1]), 2)

def print_message(self, text):
    result = self.formal_persentage(text)
    if result > 0.5:
        print(f'{text} : μ‘΄λŒ“λ§μž…λ‹ˆλ‹€. ( ν™•λ₯  {result*100}% )')
    if result < 0.5:
        print(f'{text} : λ°˜λ§μž…λ‹ˆλ‹€. ( ν™•λ₯  {((1 - result)*100)}% )')
```

κ²°κ³Ό 
```
μ €λ²ˆμ— κ΅μˆ˜λ‹˜κ»˜μ„œ 자료 κ°€μ Έμ˜€λΌν•˜μ…¨λŠ”λ° κΈ°μ–΅λ‚˜μ„Έμš”? : μ‘΄λŒ“λ§μž…λ‹ˆλ‹€. ( ν™•λ₯  99.19% )
μ €λ²ˆμ— κ΅μˆ˜λ‹˜κ»˜μ„œ 자료 κ°€μ Έμ˜€λΌν–ˆλŠ”λ° κΈ°μ–΅λ‚˜? : λ°˜λ§μž…λ‹ˆλ‹€. ( ν™•λ₯  92.86% )
```



***

## 인용
```bash
@misc{SmilegateAI2022KoreanSmileStyleDataset,
  title         = {SmileStyle: Parallel Style-variant Corpus for Korean Multi-turn Chat Text Dataset},
  author        = {Seonghyun Kim},
  year          = {2022},
  howpublished  = {\url{https://github.com/smilegate-ai/korean_smile_style_dataset}},
}
```

```bash
@inproceedings{lee2020kcbert,
  title={KcBERT: Korean Comments BERT},
  author={Lee, Junbum},
  booktitle={Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology},
  pages={437--440},
  year={2020}
}
```