Multilingual Sarcasm Detector
Multilingual Sarcasm Detector is a text classification model built to detect sarcasm from news article titles. It is fine-tuned on bert-base-multilingual-uncased and the training data consists of ready-made datasets available on Kaggle as well scraped data from multiple newspapers in English, Dutch and Italian.
Labels: 0 -> Not Sarcastic; 1 -> Sarcastic
Source Data
Datasets:
- English language data: Kaggle: News Headlines Dataset For Sarcasm Detection.
- Dutch non-sarcastic data: Kaggle: Dutch News Articles
Scraped data:
- Dutch sarcastic news from De Speld
- Italian non-sarcastic news from Il Giornale
- Italian sarcastic news from Lercio
Training Dataset
Codebase:
- Git Repo: Official repository
Example of classification
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
import string
def preprocess_data(text: str) -> str:
return text.lower().translate(str.maketrans("", "", string.punctuation)).strip()
MODEL_PATH = "helinivan/multilingual-sarcasm-detector"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)
text = "CIA Realizes It's Been Using Black Highlighters All These Years."
tokenized_text = tokenizer([preprocess_data(text)], padding=True, truncation=True, max_length=256, return_tensors="pt")
output = model(**tokenized_text)
probs = output.logits.softmax(dim=-1).tolist()[0]
confidence = max(probs)
prediction = probs.index(confidence)
results = {"is_sarcastic": prediction, "confidence": confidence}
Output:
{'is_sarcastic': 1, 'confidence': 0.9374828934669495}
Performance
Model-Name | F1 | Precision | Recall | Accuracy |
---|---|---|---|---|
helinivan/english-sarcasm-detector | 92.38 | 92.75 | 92.38 | 92.42 |
helinivan/italian-sarcasm-detector | 88.26 | 87.66 | 89.66 | 88.69 |
helinivan/multilingual-sarcasm-detector | 87.23 | 88.65 | 86.33 | 88.30 |
helinivan/dutch-sarcasm-detector | 83.02 | 84.27 | 82.01 | 86.81 |
- Downloads last month
- 206
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.