zeynepgulhan commited on
Commit
9044a3b
1 Parent(s): be74de9

update readme

Browse files
Files changed (1) hide show
  1. README.md +161 -0
README.md CHANGED
@@ -1,3 +1,164 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+
5
+ Offensive Language Detection For Turkish Language
6
+
7
+ ## Model Description
8
+ This model has been fine-tuned using [dbmdz/bert-base-turkish-128k-uncased](https://huggingface.co/dbmdz/bert-base-turkish-128k-uncased) model with the [OffensEval 2020](https://huggingface.co/datasets/offenseval2020_tr) dataset.
9
+ The offenseval-tr dataset contains 31,756 annotated tweets.
10
+
11
+ ## Dataset Distribution
12
+
13
+ | | Non Offensive(0) | Offensive (1)|
14
+ |-----------|------------------|--------------|
15
+ | Train | 25625 | 6131 |
16
+ | Test | 2812 | 716 |
17
+
18
+
19
+ ## Preprocessing Steps
20
+
21
+ | Process | Description |
22
+ |--------------------------------------------------|---------------------------------------------------|
23
+ | Accented character transformation | Converting accented characters to their unaccented equivalents |
24
+ | Lowercase transformation | Converting all text to lowercase |
25
+ | Removing @user mentions | Removing @user formatted user mentions from text |
26
+ | Removing hashtag expressions | Removing #hashtag formatted expressions from text |
27
+ | Removing URLs | Removing URLs from text |
28
+ | Removing punctuation and punctuated emojis | Removing punctuation marks and emojis presented with punctuation from text |
29
+ | Removing emojis | Removing emojis from text |
30
+ | Deasciification | Converting ASCII text into text containing Turkish characters |
31
+
32
+
33
+ The performance of each pre-process was analyzed.
34
+ Removing digits and keeping hashtags had no effect.
35
+
36
+
37
+ ## Usage
38
+
39
+ Install necessary libraries:
40
+
41
+ ```pip install git+https://github.com/emres/turkish-deasciifier.git```
42
+
43
+ ```pip install keras_preprocessing```
44
+
45
+
46
+ Pre-processing functions are below:
47
+
48
+ ```python
49
+
50
+ from turkish.deasciifier import Deasciifier
51
+ def deasciifier(text):
52
+ deasciifier = Deasciifier(text)
53
+ return deasciifier.convert_to_turkish()
54
+
55
+ def remove_circumflex(text):
56
+ circumflex_map = {
57
+ 'â': 'a',
58
+ 'î': 'i',
59
+ 'û': 'u',
60
+ 'ô': 'o',
61
+ 'Â': 'A',
62
+ 'Î': 'I',
63
+ 'Û': 'U',
64
+ 'Ô': 'O'
65
+ }
66
+
67
+ return ''.join(circumflex_map.get(c, c) for c in text)
68
+ def turkish_lower(text):
69
+ turkish_map = {
70
+ 'I': 'ı',
71
+ 'İ': 'i',
72
+ 'Ç': 'ç',
73
+ 'Ş': 'ş',
74
+ 'Ğ': 'ğ',
75
+ 'Ü': 'ü',
76
+ 'Ö': 'ö'
77
+ }
78
+ return ''.join(turkish_map.get(c, c).lower() for c in text)
79
+ ```
80
+
81
+ Clean text using below function:
82
+
83
+ ```python
84
+ import re
85
+
86
+ def clean_text(text):
87
+ # Metindeki eğik çizgileri kaldırma
88
+ text = remove_circumflex(text)
89
+ # Metni küçük harfe dönüştürme
90
+ text = turkish_lower(text)
91
+ # deasciifier
92
+ text = deasciifier(text)
93
+ # Kullanıcı adlarını kaldırma
94
+ text = re.sub(r"@\S*", " ", text)
95
+ # Hashtag'leri kaldırma
96
+ text = re.sub(r'#\S+', ' ', text)
97
+ # URL'leri kaldırma
98
+ text = re.sub(r"http\S+|www\S+|https\S+", ' ', text, flags=re.MULTILINE)
99
+ # Noktalama işaretlerini ve metin tabanlı emojileri kaldırma
100
+ text = re.sub(r'[^\w\s]|(:\)|:\(|:D|:P|:o|:O|;\))', ' ', text)
101
+ # Emojileri kaldırma
102
+ emoji_pattern = re.compile("["
103
+ u"\U0001F600-\U0001F64F" # emoticons
104
+ u"\U0001F300-\U0001F5FF" # symbols & pictographs
105
+ u"\U0001F680-\U0001F6FF" # transport & map symbols
106
+ u"\U0001F1E0-\U0001F1FF" # flags (iOS)
107
+ u"\U00002702-\U000027B0"
108
+ u"\U000024C2-\U0001F251"
109
+ "]+", flags=re.UNICODE)
110
+ text = emoji_pattern.sub(r' ', text)
111
+
112
+ # Birden fazla boşluğu tek boşlukla değiştirme
113
+ text = re.sub(r'\s+', ' ', text).strip()
114
+ return example
115
+ ```
116
+
117
+ ## Model Initialization
118
+
119
+ ```python
120
+ # Load model directly
121
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
122
+
123
+ tokenizer = AutoTokenizer.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")
124
+ model = AutoModelForSequenceClassification.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")
125
+
126
+ ```
127
+ Check if sentence is offensive like below:
128
+
129
+ ```python
130
+ import numpy as np
131
+ def is_offensive(sentence):
132
+ d = {
133
+ 0: 'non-offensive',
134
+ 1: 'offensive'
135
+ }
136
+ normalize_text = clean_text(sentence)
137
+ test_sample = tokenizer([normalize_text], padding=True, truncation=True, max_length=256, return_tensors='pt')
138
+
139
+ test_sample = {k: v.to(device) for k, v in test_sample.items()}
140
+
141
+ output = model(**test_sample)
142
+ y_pred = np.argmax(output.logits.detach().cpu().numpy(), axis=1)
143
+
144
+ print(normalize_text, "-->", d[y_pred[0]])
145
+ return y_pred[0]
146
+
147
+ ```
148
+
149
+ ```python
150
+ is_offensive("@USER Mekanı cennet olsun, saygılar sayın avukatımız,iyi günler dilerim")
151
+ is_offensive("Bir Gün Gelecek Biriniz Bile Kalmayana Kadar Mücadeleye Devam Kökünüzü Kurutacağız !! #bebekkatilipkk")
152
+ ```
153
+
154
+ ## Evaluation
155
+
156
+ Evaluation results on test set shown on table below.
157
+ We achive %89 accuracy on test set.
158
+ ## Model Performance Metrics
159
+
160
+ | Class | Precision | Recall | F1-score | Accuracy |
161
+ |---------|-----------|--------|----------|----------|
162
+ | Class 0 | 0.92 | 0.94 | 0.93 | 0.89 |
163
+ | Class 1 | 0.73 | 0.67 | 0.70 | |
164
+ | Macro | 0.83 | 0.80 | 0.81 | |