File size: 3,020 Bytes
50058f1
e660567
50058f1
 
 
 
 
9490c26
0229dad
76505bc
50058f1
 
0229dad
50058f1
4c8a524
50058f1
 
f21b28b
 
 
 
 
50058f1
0229dad
 
56f75bc
 
50058f1
 
4dc3409
 
 
50058f1
 
81d000d
50058f1
f21b28b
50058f1
 
 
 
 
 
 
 
 
 
81d000d
50058f1
0229dad
50058f1
 
 
 
0229dad
50058f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57165be
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
language: "multilingual"
tags:
- bert
- sarcasm-detection
- text-classification
widget:
- text: "Auto, stop a diesel e benzina dal 2035. Ecco cosa cambia per i consumatori"
- text: "CIA Realizes It's Been Using Black Highlighters All These Years."
- text: "We deden een man een nacht in een vat met cola en nu is hij dood"
---

# Multilingual Sarcasm Detector

Multilingual Sarcasm Detector is a text classification model built to detect sarcasm from news article titles. It is fine-tuned on [bert-multilingual-uncased](https://huggingface.co/bert-base-multilingual-uncased) and the training data consists of ready-made datasets available on Kaggle as well scraped data from multiple newspapers in English, Dutch and Italian.


<b>Labels</b>: 
0 -> Not Sarcastic;
1 -> Sarcastic


## Training Data
	  
Datasets:
- English language data: [Kaggle: News Headlines Dataset For Sarcasm Detection](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection).
- Dutch non-sarcastic data: [Kaggle: Dutch News Articles](https://www.kaggle.com/datasets/maxscheijen/dutch-news-articles)

Scraped data:
- Dutch sarcastic news from [De Speld](https://speld.nl)
- Italian non-sarcastic news from [Il Giornale](https://www.ilgiornale.it)
- Italian sarcastic news from [Lercio](https://www.lercio.it)

Codebase:
- Git Repo: [Official repository](https://github.com/helinivan/multilingual-sarcasm-detector)


---

## Example of classification

```python
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
import string

def preprocess_data(text: str) -> str:
   return text.lower().translate(str.maketrans("", "", string.punctuation)).strip()

MODEL_PATH = "helinivan/multilingual-sarcasm-detector"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)

text = "CIA Realizes It's Been Using Black Highlighters All These Years."
tokenized_text = tokenizer([preprocess_data(text)], padding=True, truncation=True, max_length=512, return_tensors="pt")
output = model(**tokenized_text)
probs = output.logits.softmax(dim=-1).tolist()[0]
confidence = max(probs)
prediction = probs.index(confidence)
results = {"is_sarcastic": prediction, "confidence": confidence}

```

Output: 

```
{'is_sarcastic': 1, 'confidence': 0.9999909400939941}
```

## Performance
| Model-Name | F1 | Precision | Recall | Accuracy 
| ------------- |:-------------| -----| -----| ----| 
| [helinivan/english-sarcasm-detector ](https://huggingface.co/helinivan/english-sarcasm-detector)| 92.38 | 92.75 | 92.38 | 92.42
| [helinivan/italian-sarcasm-detector ](https://huggingface.co/helinivan/italian-sarcasm-detector) | 88.26 | 87.66 | 89.66 | 88.69
| [helinivan/multilingual-sarcasm-detector ](https://huggingface.co/helinivan/multilingual-sarcasm-detector) | **87.23** | 88.65 | 86.33 | 88.30
| [helinivan/dutch-sarcasm-detector ](https://huggingface.co/helinivan/dutch-sarcasm-detector) | 83.02 | 84.27 | 82.01 | 86.81