File size: 5,750 Bytes
da55f57 368df19 da55f57 368df19 da55f57 9044a3b fb237b1 9044a3b 9bf45ac 9044a3b 368df19 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
---
language: tr
tags:
- bert
- turkish
- text-classification
- offensive-language-detection
license: mit
datasets:
- offenseval2020_tr
metrics:
- accuracy
- f1
- precision
- recall
---
Offensive Language Detection For Turkish Language
## Model Description
This model has been fine-tuned using [dbmdz/bert-base-turkish-128k-uncased](https://huggingface.co/dbmdz/bert-base-turkish-128k-uncased) model with the [OffensEval 2020](https://huggingface.co/datasets/offenseval2020_tr) dataset.
The offenseval-tr dataset contains 31,756 annotated tweets.
## Dataset Distribution
| | Non Offensive(0) | Offensive (1)|
|-----------|------------------|--------------|
| Train | 25625 | 6131 |
| Test | 2812 | 716 |
## Preprocessing Steps
| Process | Description |
|--------------------------------------------------|---------------------------------------------------|
| Accented character transformation | Converting accented characters to their unaccented equivalents |
| Lowercase transformation | Converting all text to lowercase |
| Removing @user mentions | Removing @user formatted user mentions from text |
| Removing hashtag expressions | Removing #hashtag formatted expressions from text |
| Removing URLs | Removing URLs from text |
| Removing punctuation and punctuated emojis | Removing punctuation marks and emojis presented with punctuation from text |
| Removing emojis | Removing emojis from text |
| Deasciification | Converting ASCII text into text containing Turkish characters |
The performance of each pre-process was analyzed.
Removing digits and keeping hashtags had no effect.
## Usage
Install necessary libraries:
```pip install git+https://github.com/emres/turkish-deasciifier.git```
```pip install keras_preprocessing```
Pre-processing functions are below:
```python
from turkish.deasciifier import Deasciifier
def deasciifier(text):
deasciifier = Deasciifier(text)
return deasciifier.convert_to_turkish()
def remove_circumflex(text):
circumflex_map = {
'â': 'a',
'î': 'i',
'û': 'u',
'ô': 'o',
'Â': 'A',
'Î': 'I',
'Û': 'U',
'Ô': 'O'
}
return ''.join(circumflex_map.get(c, c) for c in text)
def turkish_lower(text):
turkish_map = {
'I': 'ı',
'İ': 'i',
'Ç': 'ç',
'Ş': 'ş',
'Ğ': 'ğ',
'Ü': 'ü',
'Ö': 'ö'
}
return ''.join(turkish_map.get(c, c).lower() for c in text)
```
Clean text using below function:
```python
import re
def clean_text(text):
# Metindeki şapkalı harfleri kaldırma
text = remove_circumflex(text)
# Metni küçük harfe dönüştürme
text = turkish_lower(text)
# deasciifier
text = deasciifier(text)
# Kullanıcı adlarını kaldırma
text = re.sub(r"@\S*", " ", text)
# Hashtag'leri kaldırma
text = re.sub(r'#\S+', ' ', text)
# URL'leri kaldırma
text = re.sub(r"http\S+|www\S+|https\S+", ' ', text, flags=re.MULTILINE)
# Noktalama işaretlerini ve metin tabanlı emojileri kaldırma
text = re.sub(r'[^\w\s]|(:\)|:\(|:D|:P|:o|:O|;\))', ' ', text)
# Emojileri kaldırma
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
text = emoji_pattern.sub(r' ', text)
# Birden fazla boşluğu tek boşlukla değiştirme
text = re.sub(r'\s+', ' ', text).strip()
return text
```
## Model Initialization
```python
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")
model = AutoModelForSequenceClassification.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")
```
Check if sentence is offensive like below:
```python
import numpy as np
def is_offensive(sentence):
d = {
0: 'non-offensive',
1: 'offensive'
}
normalize_text = clean_text(sentence)
test_sample = tokenizer([normalize_text], padding=True, truncation=True, max_length=256, return_tensors='pt')
test_sample = {k: v.to(device) for k, v in test_sample.items()}
output = model(**test_sample)
y_pred = np.argmax(output.logits.detach().cpu().numpy(), axis=1)
print(normalize_text, "-->", d[y_pred[0]])
return y_pred[0]
```
```python
is_offensive("@USER Mekanı cennet olsun, saygılar sayın avukatımız,iyi günler dilerim")
is_offensive("Bir Gün Gelecek Biriniz Bile Kalmayana Kadar Mücadeleye Devam Kökünüzü Kurutacağız !! #bebekkatilipkk")
```
## Evaluation
Evaluation results on test set shown on table below.
We achive %89 accuracy on test set.
## Model Performance Metrics
| Class | Precision | Recall | F1-score | Accuracy |
|---------|-----------|--------|----------|----------|
| Class 0 | 0.92 | 0.94 | 0.93 | 0.89 |
| Class 1 | 0.73 | 0.67 | 0.70 | |
| Macro | 0.83 | 0.80 | 0.81 | | |