j5ng's picture
Update README.md
d6f3e4d
---
license: apache-2.0
language:
- ko
pipeline_tag: text2text-generation
---
# korean Formal Convertor Using Deep Learning
์กด๋Œ“๋ง๊ณผ ๋ฐ˜๋ง์€ ํ•œ๊ตญ์–ด์—์„œ๋งŒ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค, ๋ณธ ๋ชจ๋ธ์€ ๋ฐ˜๋ง(informal)์„ ์กด๋Œ“๋ง(formal)๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” ๋ณ€ํ™˜๊ธฐ(convertor) ์ž…๋‹ˆ๋‹ค. <br>
*ํ™•๋ณดํ•œ ์กด๋Œ“๋ง ๋ฐ์ดํ„ฐ์…‹์—๋Š” "ํ•ด์š”์ฒด"์™€ "ํ•ฉ์‡ผ์ฒด" ๋‘ ์ข…๋ฅ˜๊ฐ€ ์กด์žฌํ–ˆ์ง€๋งŒ ๋ณธ ๋ชจ๋ธ์€ "ํ•ด์š”์ฒด"๋กœ ํ†ต์ผํ•˜์—ฌ ๋ณ€ํ™˜ํ•˜๊ธฐ๋กœ ๊ฒฐ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.
|ํ•ฉ์‡ผ์ฒด|*ํ•ด์š”์ฒด|
|------|---|
|์•ˆ๋…•ํ•˜์‹ญ๋‹ˆ๊นŒ.|์•ˆ๋…•ํ•˜์„ธ์š”.|
|์ข‹์€ ์•„์นจ์ž…๋‹ˆ๋‹ค.|์ข‹์€ ์•„์นจ์ด์—์š”.|
|๋ฐ”์˜์‹œ์ง€ ์•Š์•˜์œผ๋ฉด ์ข‹๊ฒ ์Šต๋‹ˆ๋‹ค.|๋ฐ”์˜์‹œ์ง€ ์•Š์•˜์œผ๋ฉด ์ข‹๊ฒ ์–ด์š”.|
## ๋ฐฐ๊ฒฝ
- ์ด์ „์— ์กด๋Œ“๋ง๊ณผ ๋ฐ˜๋ง์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ถ„๋ฅ˜๊ธฐ(https://github.com/jongmin-oh/korean-formal-classifier) ๋ฅผ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.<br>
๋ถ„๋ฅ˜๊ธฐ๋กœ ๋งํˆฌ๋ฅผ ๋‚˜๋ˆ  ์‚ฌ์šฉํ•˜๋ คํ–ˆ์ง€๋งŒ, ์ƒ๋Œ€์ ์œผ๋กœ ์กด๋Œ“๋ง์˜ ๋น„์ค‘์ด ์ ์—ˆ๊ณ  ๋ฐ˜๋ง์„ ์กด๋Œ“๋ง๋กœ ๋ฐ”๊พธ์–ด ์กด๋Œ“๋ง ๋ฐ์ดํ„ฐ์˜ ๋น„์ค‘์„ ๋Š˜๋ฆฌ๊ธฐ์œ„ํ•ด ๋งŒ๋“ค๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
## ํ•œ๊ตญ์–ด ์กด๋Œ“๋ง ๋ณ€ํ™˜๊ธฐ
- ์กด๋Œ“๋ง ๋ณ€ํ™˜๊ธฐ๋Š” T5๋ชจ๋ธ ์•„ํ‚คํ…์ณ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœํ•œ Text2Text generation Task๋ฅผ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ ๋ฐ˜๋ง์„ ์กด๋Œ“๋ง๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
- ๋ฐ”๋กœ ์‚ฌ์šฉํ•˜์‹ค ๋ถ„๋“ค์€ ๋ฐ‘์— ์˜ˆ์ œ ์ฝ”๋“œ ์ฐธ๊ณ ํ•ด์„œ huggingFace ๋ชจ๋ธ('j5ng/et5-formal-convertor') ๋‹ค์šด๋ฐ›์•„ ์‚ฌ์šฉํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
## Base on PLM model(ET5)
- ETRI(https://aiopen.etri.re.kr/et5Model)
## Base on Dataset
- AIํ—ˆ๋ธŒ(https://www.aihub.or.kr/) : ํ•œ๊ตญ์–ด ์–ด์ฒด ๋ณ€ํ™˜ ์ฝ”ํผ์Šค
1. KETI ์ผ์ƒ์˜คํ”ผ์Šค ๋Œ€ํ™” 1,254 ๋ฌธ์žฅ
2. ์ˆ˜๋™ํƒœ๊น… ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ
- ์Šค๋งˆ์ผ๊ฒŒ์ดํŠธ ๋งํˆฌ ๋ฐ์ดํ„ฐ ์…‹(korean SmileStyle Dataset)
### Preprocessing
1. ๋ฐ˜๋ง/์กด๋Œ“๋ง ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ("ํ•ด์š”์ฒด"๋งŒ ๋ถ„๋ฆฌ)
- ์Šค๋งˆ์ผ๊ฒŒ์ดํŠธ ๋ฐ์ดํ„ฐ์—์„œ (['formal','informal']) ์นผ๋Ÿผ๋งŒ ์‚ฌ์šฉ
- ์ˆ˜๋™ํƒœ๊น… ๋ณ‘๋ ฌ๋ฐ์ดํ„ฐ์—์„œ ["*.ban", "*.yo"] txt ํŒŒ์ผ๋งŒ ์‚ฌ์šฉ
- KETI ์ผ์ƒ์˜คํ”ผ์Šค ๋ฐ์ดํ„ฐ์—์„œ(["๋ฐ˜๋ง","ํ•ด์š”์ฒด"]) ์นผ๋Ÿผ๋งŒ ์‚ฌ์šฉ
2. ๋ฐ์ดํ„ฐ ์…‹ ๋ณ‘ํ•ฉ(3๊ฐ€์ง€ ๋ฐ์ดํ„ฐ ์…‹ ๋ณ‘ํ•ฉ)
3. ๋งˆ์นจํ‘œ(.)์™€ ์‰ผํ‘œ(,)์ œ๊ฑฐ
4. ๋ฐ˜๋ง(informal) ์นผ๋Ÿผ ์ค‘๋ณต ์ œ๊ฑฐ : 1632๊ฐœ ์ค‘๋ณต๋ฐ์ดํ„ฐ ์ œ๊ฑฐ
### ์ตœ์ข… ํ•™์Šต๋ฐ์ดํ„ฐ ์˜ˆ์‹œ
|informal|formal|
|------|---|
|์‘ ๊ณ ๋งˆ์›Œ|๋„ค ๊ฐ์‚ฌํ•ด์š”|
|๋‚˜๋„ ๊ทธ ์ฑ… ์ฝ์—ˆ์–ด ๊ต‰์žฅํžˆ ์›ƒ๊ธด ์ฑ…์ด์˜€์–ด|์ €๋„ ๊ทธ ์ฑ… ์ฝ์—ˆ์Šต๋‹ˆ๋‹ค ๊ต‰์žฅํžˆ ์›ƒ๊ธด ์ฑ…์ด์˜€์–ด์š”|
|๋ฏธ์„ธ๋จผ์ง€๊ฐ€ ๋งŽ์€ ๋‚ ์ด์•ผ|๋ฏธ์„ธ๋จผ์ง€๊ฐ€ ๋งŽ์€ ๋‚ ์ด๋„ค์š”|
|๊ดœ์ฐฎ๊ฒ ์–ด?|๊ดœ์ฐฎ์œผ์‹ค๊นŒ์š”?|
|์•„๋‹ˆ์•ผ ํšŒ์˜๊ฐ€ ์ž ์‹œ ๋’ค์— ์žˆ์–ด ์ค€๋น„ํ•ด์ค˜|์•„๋‹ˆ์—์š” ํšŒ์˜๊ฐ€ ์ž ์‹œ ๋’ค์— ์žˆ์–ด์š” ์ค€๋น„ํ•ด์ฃผ์„ธ์š”|
#### total : 14,992 ์Œ
***
## How to use
```python
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
# T5 ๋ชจ๋ธ ๋กœ๋“œ
model = T5ForConditionalGeneration.from_pretrained("j5ng/et5-formal-convertor")
tokenizer = T5Tokenizer.from_pretrained("j5ng/et5-formal-convertor")
device = "cuda:0" if torch.cuda.is_available() else "cpu"
# device = "mps:0" if torch.cuda.is_available() else "cpu" # for mac m1
model = model.to(device)
# ์˜ˆ์‹œ ์ž…๋ ฅ ๋ฌธ์žฅ
input_text = "๋‚˜ ์ง„์งœ ํ™”๋‚ฌ์–ด ์ง€๊ธˆ"
# ์ž…๋ ฅ ๋ฌธ์žฅ ์ธ์ฝ”๋”ฉ
input_encoding = tokenizer("์กด๋Œ“๋ง๋กœ ๋ฐ”๊ฟ”์ฃผ์„ธ์š”: " + input_text, return_tensors="pt")
input_ids = input_encoding.input_ids.to(device)
attention_mask = input_encoding.attention_mask.to(device)
# T5 ๋ชจ๋ธ ์ถœ๋ ฅ ์ƒ์„ฑ
output_encoding = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_length=128,
num_beams=5,
early_stopping=True,
)
# ์ถœ๋ ฅ ๋ฌธ์žฅ ๋””์ฝ”๋”ฉ
output_text = tokenizer.decode(output_encoding[0], skip_special_tokens=True)
# ๊ฒฐ๊ณผ ์ถœ๋ ฅ
print(output_text) # ์ € ์ง„์งœ ํ™”๋‚ฌ์Šต๋‹ˆ๋‹ค ์ง€๊ธˆ.
```
***
## With Transformer Pipeline
```python
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer, pipeline
model = T5ForConditionalGeneration.from_pretrained('j5ng/et5-formal-convertor')
tokenizer = T5Tokenizer.from_pretrained('j5ng/et5-formal-convertor')
typos_corrector = pipeline(
"text2text-generation",
model=model,
tokenizer=tokenizer,
device=0 if torch.cuda.is_available() else -1,
framework="pt",
)
input_text = "๋„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์„๊ฑฐ๋ผ ์ƒ๊ฐํ–ˆ์–ด"
output_text = typos_corrector("์กด๋Œ“๋ง๋กœ ๋ฐ”๊ฟ”์ฃผ์„ธ์š”: " + input_text,
max_length=128,
num_beams=5,
early_stopping=True)[0]['generated_text']
print(output_text) # ๋‹น์‹ ์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์„๊ฑฐ๋ผ ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค.
```
## Thanks to
์กด๋Œ“๋ง ๋ณ€ํ™˜๊ธฐ์˜ ํ•™์Šต์€ ์ธ๊ณต์ง€๋Šฅ์‚ฐ์—…์œตํ•ฉ์‚ฌ์—…๋‹จ(AICA)์˜ GPU ๋ฆฌ์†Œ์Šค๋ฅผ ์ง€์›๋ฐ›์•„ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.