File size: 2,986 Bytes
50058f1
e660567
50058f1
 
 
 
 
9780bac
0229dad
76505bc
50058f1
 
0229dad
50058f1
9780bac
50058f1
 
f21b28b
 
 
 
 
50058f1
0229dad
 
56f75bc
 
50058f1
 
4dc3409
 
 
50058f1
 
81d000d
50058f1
f21b28b
50058f1
 
 
 
 
 
 
 
 
 
81d000d
50058f1
0229dad
50058f1
 
 
 
0229dad
9780bac
50058f1
 
 
 
 
 
 
 
 
 
 
9780bac
50058f1
 
 
 
 
57165be
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
language: "multilingual"
tags:
- bert
- sarcasm-detection
- text-classification
widget:
- text: "Gli Usa a un passo dalla recessione"
- text: "CIA Realizes It's Been Using Black Highlighters All These Years."
- text: "We deden een man een nacht in een vat met cola en nu is hij dood"
---

# Multilingual Sarcasm Detector

Multilingual Sarcasm Detector is a text classification model built to detect sarcasm from news article titles. It is fine-tuned on [bert-base-multilingual-uncased](https://huggingface.co/bert-base-multilingual-uncased) and the training data consists of ready-made datasets available on Kaggle as well scraped data from multiple newspapers in English, Dutch and Italian.


<b>Labels</b>: 
0 -> Not Sarcastic;
1 -> Sarcastic


## Training Data
	  
Datasets:
- English language data: [Kaggle: News Headlines Dataset For Sarcasm Detection](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection).
- Dutch non-sarcastic data: [Kaggle: Dutch News Articles](https://www.kaggle.com/datasets/maxscheijen/dutch-news-articles)

Scraped data:
- Dutch sarcastic news from [De Speld](https://speld.nl)
- Italian non-sarcastic news from [Il Giornale](https://www.ilgiornale.it)
- Italian sarcastic news from [Lercio](https://www.lercio.it)

Codebase:
- Git Repo: [Official repository](https://github.com/helinivan/multilingual-sarcasm-detector)


---

## Example of classification

```python
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
import string

def preprocess_data(text: str) -> str:
   return text.lower().translate(str.maketrans("", "", string.punctuation)).strip()

MODEL_PATH = "helinivan/multilingual-sarcasm-detector"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)

text = "CIA Realizes It's Been Using Black Highlighters All These Years."
tokenized_text = tokenizer([preprocess_data(text)], padding=True, truncation=True, max_length=256, return_tensors="pt")
output = model(**tokenized_text)
probs = output.logits.softmax(dim=-1).tolist()[0]
confidence = max(probs)
prediction = probs.index(confidence)
results = {"is_sarcastic": prediction, "confidence": confidence}

```

Output: 

```
{'is_sarcastic': 1, 'confidence': 0.9374828934669495}
```

## Performance
| Model-Name | F1 | Precision | Recall | Accuracy 
| ------------- |:-------------| -----| -----| ----| 
| [helinivan/english-sarcasm-detector ](https://huggingface.co/helinivan/english-sarcasm-detector)| 92.38 | 92.75 | 92.38 | 92.42
| [helinivan/italian-sarcasm-detector ](https://huggingface.co/helinivan/italian-sarcasm-detector) | 88.26 | 87.66 | 89.66 | 88.69
| [helinivan/multilingual-sarcasm-detector ](https://huggingface.co/helinivan/multilingual-sarcasm-detector) | **87.23** | 88.65 | 86.33 | 88.30
| [helinivan/dutch-sarcasm-detector ](https://huggingface.co/helinivan/dutch-sarcasm-detector) | 83.02 | 84.27 | 82.01 | 86.81