File size: 3,929 Bytes
f36f417
 
 
 
50274d7
 
 
f4cfb6f
e8c7c0f
7329fc9
 
 
190e59b
7329fc9
 
f36f417
e6e1d10
 
ed3085a
e6e1d10
 
f36f417
e6e1d10
f36f417
254a990
f36f417
d466e2c
f36f417
e6e1d10
f36f417
e6e1d10
 
254a990
e6e1d10
254a990
f36f417
e6e1d10
f36f417
254a990
f36f417
e6e1d10
 
254a990
 
 
 
 
f36f417
e6e1d10
f36f417
e6e1d10
f36f417
e6e1d10
 
 
f36f417
254a990
 
e6e1d10
f36f417
e6e1d10
 
f36f417
e6e1d10
f36f417
e6e1d10
 
 
 
 
 
 
f36f417
e6e1d10
f36f417
e6e1d10
f36f417
e6e1d10
f36f417
e6e1d10
f36f417
254a990
f36f417
e6e1d10
f36f417
254a990
 
 
 
f36f417
e6e1d10
f36f417
e6e1d10
f36f417
254a990
f36f417
254a990
f36f417
04629d7
 
1c3155e
f36f417
e6e1d10
f36f417
254a990
 
e6e1d10
1c3155e
254a990
e6e1d10
 
f36f417
e6e1d10
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
license: cc
language:
- pt
tags:
- Hate Speech
- kNOwHATE
- not-for-all-audiences
widget:
- text: >-
    as pessoas tem que perceber que ser 'panasca' não é deixar de ser homem, é
    deixar de ser humano 😂😂
pipeline_tag: text-classification
datasets:
- knowhate/youtube-test
---
---
<img align="left" width="140" height="140" src="https://ilga-portugal.pt/files/uploads/2023/06/logo_HATE_cores_page-0001-1024x539.jpg">
<p style="text-align: center;">&nbsp;&nbsp;&nbsp;&nbsp;This is the model card for HateBERTimbau-YouTube. 
  You may be interested in some of the other models from the <a href="https://huggingface.co/knowhate">kNOwHATE project</a>.
</p>

---

# HateBERTimbau-YouTube

**HateBERTimbau-YouTube** is a transformer-based encoder model for identifying Hate Speech in Portuguese social media text. It is a fine-tuned version of [HateBERTimbau](https://huggingface.co/knowhate/HateBERTimbau) model, retrained on a dataset of 23,912 YouTube comments specifically focused on Hate Speech.

## Model Description

- **Developed by:** [kNOwHATE: kNOwing online HATE speech: knowledge + awareness = TacklingHate](https://knowhate.eu)
- **Funded by:** [European Union](https://ec.europa.eu/info/funding-tenders/opportunities/portal/screen/opportunities/topic-details/cerv-2021-equal)
- **Model type:** Transformer-based text classification model fine-tuned for Hate Speech detection in Portuguese social media text
- **Language:** Portuguese
- **Fine-tuned from model:** [knowhate/HateBERTimbau](https://huggingface.co/knowhate/HateBERTimbau)

# Uses

You can use this model directly with a pipeline for text classification:

```python
from transformers import pipeline
classifier = pipeline('text-classification', model='knowhate/HateBERTimbau-youtube')

classifier("as pessoas tem que perceber que ser 'panasca' não é deixar de ser homem, é deixar de ser humano 😂😂")

[{'label': 'Hate Speech', 'score': 0.9228119850158691}]

```

Or this model can be used by fine-tuning it for a specific task/dataset:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("knowhate/HateBERTimbau-youtube")
model = AutoModelForSequenceClassification.from_pretrained("knowhate/HateBERTimbau-youtube")
dataset = load_dataset("knowhate/youtube-train")

def tokenize_function(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

training_args = TrainingArguments(output_dir="hatebertimbau", evaluation_strategy="epoch")
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

trainer.train()

```

# Training

## Data

23,912 YouTube comments associated with offensive content were used to fine-tune the base model.

## Training Hyperparameters

- Batch Size: 32
- Epochs: 3
- Learning Rate: 2e-5 with Adam optimizer
- Maximum Sequence Length: 350 tokens

# Testing

## Data

The dataset used to test this model was: [knowhate/youtube-test](https://huggingface.co/datasets/knowhate/youtube-test)

## Results

| Dataset                       | Precision  | Recall    | F1-score     |
|:------------------------------|:-----------|:----------|:-------------|
| **knowhate/youtube-test**     | 0.856      | 0.892     | **0.874**    | 

# BibTeX Citation

Currently in Peer Review

``` latex
@article{

}
```

# Acknowledgements

This work was funded in part by the European Union under Grant CERV-2021-EQUAL (101049306).
However the views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or Knowhate Project.
Neither the European Union nor the Knowhate Project can be held responsible.