File size: 4,243 Bytes
7ffc0c2 dfd4b44 7ffc0c2 7513393 efe70a2 7513393 7ffc0c2 efe70a2 948a8a9 efe70a2 7ffc0c2 efe70a2 7ffc0c2 948a8a9 7ffc0c2 948a8a9 7ffc0c2 efe70a2 7ffc0c2 efe70a2 7ffc0c2 efe70a2 7ffc0c2 efe70a2 7ffc0c2 efe70a2 34b0054 7ffc0c2 efe70a2 7ffc0c2 34b0054 7ffc0c2 efe70a2 7ffc0c2 efe70a2 7ffc0c2 efe70a2 7ffc0c2 34b0054 efe70a2 7ffc0c2 efe70a2 7ffc0c2 efe70a2 7ffc0c2 efe70a2 7ffc0c2 efe70a2 7ffc0c2 efe70a2 7ffc0c2 efe70a2 7ffc0c2 efe70a2 7ffc0c2 34b0054 7ffc0c2 efe70a2 7ffc0c2 efe70a2 7ffc0c2 efe70a2 7ffc0c2 efe70a2 7ffc0c2 34b0054 7ffc0c2 efe70a2 7ffc0c2 efe70a2 34b0054 7ffc0c2 efe70a2 7ffc0c2 efe70a2 7ffc0c2 efe70a2 7ffc0c2 efe70a2 7ffc0c2 efe70a2 7ffc0c2 efe70a2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
---
license: cc
language:
- pt
tags:
- Hate Speech
- kNOwHATE
- not-for-all-audiences
widget:
- text: >-
as pessoas tem que perceber que ser 'panasca' não é deixar de ser homem, é
deixar de ser humano 😂😂
pipeline_tag: text-classification
datasets:
- knowhate/youtube-test
- knowhate/twitter-test
---
---
<img align="left" width="140" height="140" src="https://ilga-portugal.pt/files/uploads/2023/06/logo_HATE_cores_page-0001-1024x539.jpg">
<p style="text-align: center;"> This is the model card for HateBERTimbau-YouTube-Twitter.
You may be interested in some of the other models from the <a href="https://huggingface.co/knowhate">kNOwHATE project</a>.
</p>
---
# HateBERTimbau-YouTube-Twitter
**HateBERTimbau-YouTube-Twitter** is a transformer-based encoder model for identifying Hate Speech in Portuguese social media text. It is a fine-tuned version of [HateBERTimbau](https://huggingface.co/knowhate/HateBERTimbau) model, retrained on a dataset of 23,912 YouTube comments and 21,546 tweets for a total of 45,458 online messages specifically focused on Hate Speech.
## Model Description
- **Developed by:** [kNOwHATE: kNOwing online HATE speech: knowledge + awareness = TacklingHate](https://knowhate.eu)
- **Funded by:** [European Union](https://ec.europa.eu/info/funding-tenders/opportunities/portal/screen/opportunities/topic-details/cerv-2021-equal)
- **Model type:** Transformer-based text classification model fine-tuned for Hate Speech detection in Portuguese social media text
- **Language:** Portuguese
- **Fine-tuned from model:** [knowhate/HateBERTimbau](https://huggingface.co/knowhate/HateBERTimbau)
# Uses
You can use this model directly with a pipeline for text classification:
```python
from transformers import pipeline
classifier = pipeline('text-classification', model='knowhate/HateBERTimbau-yt-tt')
classifier("as pessoas tem que perceber que ser 'panasca' não é deixar de ser homem, é deixar de ser humano 😂😂")
[{'label': 'Hate Speech', 'score': 0.9959186911582947}]
```
Or this model can be used by fine-tuning it for a specific task/dataset:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset
tokenizer = AutoTokenizer.from_pretrained("knowhate/HateBERTimbau-yt-tt")
model = AutoModelForSequenceClassification.from_pretrained("knowhate/HateBERTimbau-yt-tt")
dataset = load_dataset("knowhate/youtube-train")
def tokenize_function(examples):
return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
training_args = TrainingArguments(output_dir="hatebertimbau", evaluation_strategy="epoch")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
)
trainer.train()
```
# Training
## Data
23,912 YouTube comments and 21,546 tweets for a total of 45,458 online messages associated with offensive content were used to fine-tune the base model.
## Training Hyperparameters
- Batch Size: 32
- Epochs: 3
- Learning Rate: 2e-5 with Adam optimizer
- Maximum Sequence Length: 350 tokens
# Testing
## Data
The datasets used to test this model were: [knowhate/youtube-test](https://huggingface.co/datasets/knowhate/youtube-test) and [knowhate/twitter-test](https://huggingface.co/datasets/knowhate/twitter-test)
## Results
| Dataset | Precision | Recall | F1-score |
|:------------------------------|:-----------|:----------|:-------------|
| **knowhate/youtube-test** | 0.867 | 0.892 | **0.874** |
| **knowhate/twitter-test** | 0.397 | 0.627 | **0.486** |
# BibTeX Citation
Currently in Peer Review
``` latex
@article{
}
```
# Acknowledgements
This work was funded in part by the European Union under Grant CERV-2021-EQUAL (101049306).
However the views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or Knowhate Project.
Neither the European Union nor the Knowhate Project can be held responsible. |