j5ng commited on
Commit
e92823b
β€’
1 Parent(s): eb361ab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +144 -0
README.md CHANGED
@@ -1,3 +1,147 @@
1
  ---
2
  license: apache-2.0
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - ko
5
+ pipeline_tag: text-classification
6
  ---
7
+
8
+ # formal_classifier
9
+ formal classifier or honorific classifier
10
+
11
+ ## ν•œκ΅­μ–΄ μ‘΄λŒ“λ§ 반말 λΆ„λ₯˜κΈ°
12
+
13
+ μ˜€λž˜μ „μ— μ‘΄λŒ“λ§ , λ°˜λ§μ„ ν•œκ΅­μ–΄ ν˜•νƒœμ†Œ λΆ„μ„κΈ°λ‘œ λΆ„λ₯˜ν•˜λŠ” κ°„λ‹¨ν•œ 방법을 μ†Œκ°œν–ˆλ‹€.<br>
14
+ ν•˜μ§€λ§Œ 이 방법을 μ‹€μ œλ‘œ μ μš©ν•˜λ € ν–ˆλ”λ‹ˆ, λ§Žμ€ λΆ€λΆ„μ—μ„œ 였λ₯˜κ°€ λ°œμƒν•˜μ˜€λ‹€.
15
+
16
+ 예λ₯Ό λ“€λ©΄)
17
+ ```bash
18
+ μ €λ²ˆμ— κ΅μˆ˜λ‹˜κ»˜μ„œ 자료 κ°€μ Έμ˜€λΌν–ˆλŠ”λ° κΈ°μ–΅λ‚˜?
19
+ ```
20
+ λΌλŠ” 문ꡬλ₯Ό "κ»˜μ„œ"λΌλŠ” μ‘΄μΉ­λ•Œλ¬Έμ— 전체문μž₯을 μ‘΄λŒ“λ§λ‘œ νŒλ‹¨ν•˜λŠ” 였λ₯˜κ°€ 많이 λ°œμƒν–ˆλ‹€. <br>
21
+ κ·Έλž˜μ„œ μ΄λ²ˆμ— λ”₯λŸ¬λ‹ λͺ¨λΈμ„ λ§Œλ“€κ³  κ·Έ 과정을 κ³΅μœ ν•΄λ³΄κ³ μžν•œλ‹€.
22
+
23
+ #### λΉ λ₯΄κ²Œ κ°€μ Έλ‹€ μ“°μ‹€ 뢄듀은 μ•„λž˜ μ½”λ“œλ‘œ λ°”λ‘œ μ‚¬μš©ν•˜μ‹€ 수 μžˆμŠ΅λ‹ˆλ‹€.
24
+ ```python
25
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
26
+
27
+ model = AutoModelForSequenceClassification.from_pretrained("j5ng/kcbert-formal-classifier")
28
+ tokenizer = AutoTokenizer.from_pretrained('j5ng/kcbert-formal-classifier')
29
+
30
+ formal_classifier = pipeline(task="text-classification", model=model, tokenizer=tokenizer)
31
+ print(formal_classifier("μ €λ²ˆμ— κ΅μˆ˜λ‹˜κ»˜μ„œ 자료 κ°€μ Έμ˜€λΌν–ˆλŠ”λ° κΈ°μ–΅λ‚˜?"))
32
+ # [{'label': 'LABEL_0', 'score': 0.9999139308929443}]
33
+ ```
34
+
35
+ ***
36
+
37
+ ### 데이터 μ…‹ 좜처
38
+
39
+ #### 슀마일게이트 말투 데이터 μ…‹(korean SmileStyle Dataset)
40
+ : https://github.com/smilegate-ai/korean_smile_style_dataset
41
+
42
+ #### AI ν—ˆλΈŒ 감성 λŒ€ν™” λ§λ­‰μΉ˜
43
+ : https://www.aihub.or.kr/
44
+
45
+ #### 데이터셋 λ‹€μš΄λ‘œλ“œ(AIν—ˆλΈŒλŠ” μ§μ ‘λ‹€μš΄λ‘œλ“œλ§Œ κ°€λŠ₯)
46
+ ```bash
47
+ wget https://raw.githubusercontent.com/smilegate-ai/korean_smile_style_dataset/main/smilestyle_dataset.tsv
48
+ ```
49
+
50
+ ### 개발 ν™˜κ²½
51
+ ```bash
52
+ Python3.9
53
+ ```
54
+
55
+ ```bash
56
+ torch==1.13.1
57
+ transformers==4.26.0
58
+ pandas==1.5.3
59
+ emoji==2.2.0
60
+ soynlp==0.0.493
61
+ datasets==2.10.1
62
+ pandas==1.5.3
63
+ ```
64
+
65
+
66
+ #### μ‚¬μš© λͺ¨λΈ
67
+ beomi/kcbert-base
68
+ - GitHub : https://github.com/Beomi/KcBERT
69
+ - HuggingFace : https://huggingface.co/beomi/kcbert-base
70
+ ***
71
+
72
+ ## 데이터
73
+ ```bash
74
+ get_train_data.py
75
+ ```
76
+
77
+ ### μ˜ˆμ‹œ
78
+ |sentence|label|
79
+ |------|---|
80
+ |곡뢀λ₯Ό μ—΄μ‹¬νžˆ 해도 μ—΄μ‹¬νžˆ ν•œ 만큼 성적이 잘 λ‚˜μ˜€μ§€ μ•Šμ•„|0|
81
+ |μ•„λ“€μ—κ²Œ λ³΄λ‚΄λŠ” 문자λ₯Ό 톡해 관계가 회볡되길 λ°”λž„κ²Œμš”|1|
82
+ |μ°Έ μ—΄μ‹¬νžˆ 사신 보람이 μžˆμœΌμ‹œλ„€μš”|1|
83
+ |λ‚˜λ„ μŠ€μ‹œ 쒋아함 이번 달뢀터 영ꡭ 갈 λ“―|0|
84
+ |λ³ΈλΆ€μž₯λ‹˜μ΄ λ‚΄κ°€ ν•  수 μ—†λŠ” 업무λ₯Ό 계속 μ£Όμ…”μ„œ νž˜λ“€μ–΄|0|
85
+
86
+
87
+ ### 뢄포
88
+ |label|train|test|
89
+ |------|---|---|
90
+ |0|133,430|34,908|
91
+ |1|112,828|29,839|
92
+
93
+ ***
94
+
95
+ ## ν•™μŠ΅(train)
96
+ ```bash
97
+ python3 modeling/train.py
98
+ ```
99
+
100
+ ***
101
+
102
+ ## 예츑(inference)
103
+ ```bash
104
+ python3 inference.py
105
+ ```
106
+
107
+ ```python
108
+ def formal_percentage(self, text):
109
+ return round(float(self.predict(text)[0][1]), 2)
110
+
111
+ def print_message(self, text):
112
+ result = self.formal_persentage(text)
113
+ if result > 0.5:
114
+ print(f'{text} : μ‘΄λŒ“λ§μž…λ‹ˆλ‹€. ( ν™•λ₯  {result*100}% )')
115
+ if result < 0.5:
116
+ print(f'{text} : λ°˜λ§μž…λ‹ˆλ‹€. ( ν™•λ₯  {((1 - result)*100)}% )')
117
+ ```
118
+
119
+ κ²°κ³Ό
120
+ ```
121
+ μ €λ²ˆμ— κ΅μˆ˜λ‹˜κ»˜μ„œ 자료 κ°€μ Έμ˜€λΌν•˜μ…¨λŠ”λ° κΈ°μ–΅λ‚˜μ„Έμš”? : μ‘΄λŒ“λ§μž…λ‹ˆλ‹€. ( ν™•λ₯  99.19% )
122
+ μ €λ²ˆμ— κ΅μˆ˜λ‹˜κ»˜μ„œ 자료 κ°€μ Έμ˜€λΌν–ˆλŠ”λ° κΈ°μ–΅λ‚˜? : λ°˜λ§μž…λ‹ˆλ‹€. ( ν™•λ₯  92.86% )
123
+ ```
124
+
125
+
126
+
127
+ ***
128
+
129
+ ## 인용
130
+ ```bash
131
+ @misc{SmilegateAI2022KoreanSmileStyleDataset,
132
+ title = {SmileStyle: Parallel Style-variant Corpus for Korean Multi-turn Chat Text Dataset},
133
+ author = {Seonghyun Kim},
134
+ year = {2022},
135
+ howpublished = {\url{https://github.com/smilegate-ai/korean_smile_style_dataset}},
136
+ }
137
+ ```
138
+
139
+ ```bash
140
+ @inproceedings{lee2020kcbert,
141
+ title={KcBERT: Korean Comments BERT},
142
+ author={Lee, Junbum},
143
+ booktitle={Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology},
144
+ pages={437--440},
145
+ year={2020}
146
+ }
147
+ ```