--- language: tr tags: - bert - turkish - text-classification - offensive-language-detection license: mit datasets: - offenseval2020_tr metrics: - accuracy - f1 - precision - recall --- Offensive Language Detection For Turkish Language ## Model Description This model has been fine-tuned using [dbmdz/bert-base-turkish-128k-uncased](https://huggingface.co/dbmdz/bert-base-turkish-128k-uncased) model with the [OffensEval 2020](https://huggingface.co/datasets/offenseval2020_tr) dataset. The offenseval-tr dataset contains 31,756 annotated tweets. ## Dataset Distribution | | Non Offensive(0) | Offensive (1)| |-----------|------------------|--------------| | Train | 25625 | 6131 | | Test | 2812 | 716 | ## Preprocessing Steps | Process | Description | |--------------------------------------------------|---------------------------------------------------| | Accented character transformation | Converting accented characters to their unaccented equivalents | | Lowercase transformation | Converting all text to lowercase | | Removing @user mentions | Removing @user formatted user mentions from text | | Removing hashtag expressions | Removing #hashtag formatted expressions from text | | Removing URLs | Removing URLs from text | | Removing punctuation and punctuated emojis | Removing punctuation marks and emojis presented with punctuation from text | | Removing emojis | Removing emojis from text | | Deasciification | Converting ASCII text into text containing Turkish characters | The performance of each pre-process was analyzed. Removing digits and keeping hashtags had no effect. ## Usage Install necessary libraries: ```pip install git+https://github.com/emres/turkish-deasciifier.git``` ```pip install keras_preprocessing``` Pre-processing functions are below: ```python from turkish.deasciifier import Deasciifier def deasciifier(text): deasciifier = Deasciifier(text) return deasciifier.convert_to_turkish() def remove_circumflex(text): circumflex_map = { 'â': 'a', 'î': 'i', 'û': 'u', 'ô': 'o', 'Â': 'A', 'Î': 'I', 'Û': 'U', 'Ô': 'O' } return ''.join(circumflex_map.get(c, c) for c in text) def turkish_lower(text): turkish_map = { 'I': 'ı', 'İ': 'i', 'Ç': 'ç', 'Ş': 'ş', 'Ğ': 'ğ', 'Ü': 'ü', 'Ö': 'ö' } return ''.join(turkish_map.get(c, c).lower() for c in text) ``` Clean text using below function: ```python import re def clean_text(text): # Metindeki şapkalı harfleri kaldırma text = remove_circumflex(text) # Metni küçük harfe dönüştürme text = turkish_lower(text) # deasciifier text = deasciifier(text) # Kullanıcı adlarını kaldırma text = re.sub(r"@\S*", " ", text) # Hashtag'leri kaldırma text = re.sub(r'#\S+', ' ', text) # URL'leri kaldırma text = re.sub(r"http\S+|www\S+|https\S+", ' ', text, flags=re.MULTILINE) # Noktalama işaretlerini ve metin tabanlı emojileri kaldırma text = re.sub(r'[^\w\s]|(:\)|:\(|:D|:P|:o|:O|;\))', ' ', text) # Emojileri kaldırma emoji_pattern = re.compile("[" u"\U0001F600-\U0001F64F" # emoticons u"\U0001F300-\U0001F5FF" # symbols & pictographs u"\U0001F680-\U0001F6FF" # transport & map symbols u"\U0001F1E0-\U0001F1FF" # flags (iOS) u"\U00002702-\U000027B0" u"\U000024C2-\U0001F251" "]+", flags=re.UNICODE) text = emoji_pattern.sub(r' ', text) # Birden fazla boşluğu tek boşlukla değiştirme text = re.sub(r'\s+', ' ', text).strip() return text ``` ## Model Initialization ```python # Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr") model = AutoModelForSequenceClassification.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr") ``` Check if sentence is offensive like below: ```python import numpy as np def is_offensive(sentence): d = { 0: 'non-offensive', 1: 'offensive' } normalize_text = clean_text(sentence) test_sample = tokenizer([normalize_text], padding=True, truncation=True, max_length=256, return_tensors='pt') test_sample = {k: v.to(device) for k, v in test_sample.items()} output = model(**test_sample) y_pred = np.argmax(output.logits.detach().cpu().numpy(), axis=1) print(normalize_text, "-->", d[y_pred[0]]) return y_pred[0] ``` ```python is_offensive("@USER Mekanı cennet olsun, saygılar sayın avukatımız,iyi günler dilerim") is_offensive("Bir Gün Gelecek Biriniz Bile Kalmayana Kadar Mücadeleye Devam Kökünüzü Kurutacağız !! #bebekkatilipkk") ``` ## Evaluation Evaluation results on test set shown on table below. We achive %89 accuracy on test set. ## Model Performance Metrics | Class | Precision | Recall | F1-score | Accuracy | |---------|-----------|--------|----------|----------| | Class 0 | 0.92 | 0.94 | 0.93 | 0.89 | | Class 1 | 0.73 | 0.67 | 0.70 | | | Macro | 0.83 | 0.80 | 0.81 | |