|
# Bad_text_classifier |
|
|
|
## Model ์๊ฐ |
|
์ธํฐ๋ท ์์ ํผ์ ธ์๋ ์ฌ๋ฌ ๋๊ธ, ์ฑํ
์ด ๋ฏผ๊ฐํ ๋ด์ฉ์ธ์ง ์๋์ง๋ฅผ ํ๋ณํ๋ ๋ชจ๋ธ์ ๊ณต๊ฐํฉ๋๋ค. ํด๋น ๋ชจ๋ธ์ ๊ณต๊ฐ๋ฐ์ดํฐ๋ฅผ ์ฌ์ฉํด label์ ์์ ํ๊ณ ๋ฐ์ดํฐ๋ค์ ํฉ์ณ ๊ตฌ์ฑํด finetuning์ ์งํํ์์ต๋๋ค. ํด๋น ๋ชจ๋ธ์ด ์ธ์ ๋ ๋ชจ๋ ๋ฌธ์ฅ์ ์ ํํ ํ๋จ์ด ๊ฐ๋ฅํ ๊ฒ์ ์๋๋ผ๋ ์ ์ํดํด ์ฃผ์๋ฉด ๊ฐ์ฌ๋๋ฆฌ๊ฒ ์ต๋๋ค. |
|
``` |
|
NOTE) |
|
๊ณต๊ฐ ๋ฐ์ดํฐ์ ์ ์๊ถ ๋ฌธ์ ๋ก ์ธํด ๋ชจ๋ธ ํ์ต์ ์ฌ์ฉ๋ ๋ณํ๋ ๋ฐ์ดํฐ๋ ๊ณต๊ฐ ๋ถ๊ฐ๋ฅํ๋ค๋ ์ ์ ๋ฐํ๋๋ค. |
|
๋ํ ํด๋น ๋ชจ๋ธ์ ์๊ฒฌ์ ์ ์๊ฒฌ๊ณผ ๋ฌด๊ดํ๋ค๋ ์ ์ ๋ฏธ๋ฆฌ ๋ฐํ๋๋ค. |
|
``` |
|
|
|
## Dataset |
|
### data label |
|
* **0 : bad sentence** |
|
* **1 : not bad sentence** |
|
### ์ฌ์ฉํ dataset |
|
* [smilegate-ai/Korean Unsmile Dataset](https://github.com/smilegate-ai/korean_unsmile_dataset) |
|
* [kocohub/Korean HateSpeech Dataset](https://github.com/kocohub/korean-hate-speech) |
|
### dataset ๊ฐ๊ณต ๋ฐฉ๋ฒ |
|
๊ธฐ์กด ์ด์ง ๋ถ๋ฅ๊ฐ ์๋์๋ ๋ ๋ฐ์ดํฐ๋ฅผ ์ด์ง ๋ถ๋ฅ ํํ๋ก labeling์ ๋ค์ ํด์ค ๋ค, Korean HateSpeech Dataset์ค label 1(not bad sentence)๋ง์ ์ถ๋ ค ๊ฐ๊ณต๋ Korean Unsmile Dataset์ ํฉ์ณ ์ฃผ์์ต๋๋ค. |
|
</br> |
|
|
|
**Korean Unsmile Dataset์ clean์ผ๋ก labeling ๋์ด์๋ ๋ฐ์ดํฐ ์ค ๋ช๊ฐ์ ๋ฐ์ดํฐ๋ฅผ 0 (bad sentence)์ผ๋ก ์์ ํ์์ต๋๋ค.** |
|
* "~๋
ธ"๊ฐ ํฌํจ๋ ๋ฌธ์ฅ ์ค, "์ด๊ธฐ", "๋
ธ๋ฌด"๊ฐ ํฌํจ๋ ๋ฐ์ดํฐ๋ 0 (bad sentence)์ผ๋ก ์์ |
|
* "์ข", "๋ด" ๋ฑ ์ฑ ๊ด๋ จ ๋์์ค๊ฐ ํฌํจ๋ ๋ฐ์ดํฐ๋ 0 (bad sentence)์ผ๋ก ์์ |
|
</br> |
|
|
|
## Model Training |
|
* huggingface transformers์ ElectraForSequenceClassification๋ฅผ ์ฌ์ฉํด finetuning์ ์ํํ์์ต๋๋ค. |
|
* ํ๊ตญ์ด ๊ณต๊ฐ Electra ๋ชจ๋ธ ์ค 3๊ฐ์ง ๋ชจ๋ธ์ ์ฌ์ฉํด ๊ฐ๊ฐ ํ์ต์์ผ์ฃผ์์ต๋๋ค. |
|
### use model |
|
* [Beomi/KcELECTRA](https://github.com/Beomi/KcELECTRA) |
|
* [monologg/koELECTRA](https://github.com/monologg/KoELECTRA) |
|
* [tunib/electra-ko-base](https://huggingface.co/tunib/electra-ko-base) |
|
|
|
## How to use model? |
|
```PYTHON |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
|
model = AutoModelForSequenceClassification.from_pretrained('JminJ/koElectra_base_Bad_Sentence_Classifier') |
|
tokenizer = AutoTokenizer.from_pretrained('JminJ/koElectra_base_Bad_Sentence_Classifier') |
|
``` |
|
|
|
## Model Valid Accuracy |
|
| mdoel | accuracy | |
|
| ---------- | ---------- | |
|
| kcElectra_base_fp16_wd_custom_dataset | 0.8849 | |
|
| tunibElectra_base_fp16_wd_custom_dataset | 0.8726 | |
|
| koElectra_base_fp16_wd_custom_dataset | 0.8434 | |
|
``` |
|
Note) |
|
๋ชจ๋ ๋ชจ๋ธ์ ๋์ผํ seed, learning_rate(3e-06), weight_decay lambda(0.001), batch_size(128)๋ก ํ์ต๋์์ต๋๋ค. |
|
``` |
|
|
|
## Contact |
|
* jminju254@gmail.com |
|
</br></br> |
|
|
|
## Github |
|
* https://github.com/JminJ/Bad_text_classifier |
|
</br></br> |
|
|
|
## Reference |
|
* [Beomi/KcELECTRA](https://github.com/Beomi/KcELECTRA) |
|
* [monologg/koELECTRA](https://github.com/monologg/KoELECTRA) |
|
* [tunib/electra-ko-base](https://huggingface.co/tunib/electra-ko-base) |
|
* [smilegate-ai/Korean Unsmile Dataset](https://github.com/smilegate-ai/korean_unsmile_dataset) |
|
* [kocohub/Korean HateSpeech Dataset](https://github.com/kocohub/korean-hate-speech) |
|
* [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://arxiv.org/abs/2003.10555) |
|
|