|
--- |
|
license: mit |
|
datasets: |
|
- Silly-Machine/TuPyE-Dataset |
|
language: |
|
- pt |
|
|
|
pipeline_tag: text-classification |
|
base_model: neuralmind/bert-large-portuguese-cased |
|
widget: |
|
- text: 'Bom dia, flor do dia!!' |
|
|
|
model-index: |
|
- name: Yi-34B |
|
results: |
|
- task: |
|
type: text-classfication |
|
dataset: |
|
name: TuPyE-Dataset |
|
type: Silly-Machine/TuPyE-Dataset |
|
metrics: |
|
- type: f1 |
|
value: 0.85 |
|
name: F1-score |
|
verified: true |
|
- type: precision |
|
value: 0.85 |
|
name: Precision |
|
verified: true |
|
- type: recall |
|
value: 0.85 |
|
name: Recall |
|
verified: true |
|
--- |
|
|
|
## Introduction |
|
|
|
|
|
TuPy-Bert-Large-Multilabel is a fine-tuned BERT model designed specifically for multilabel classification of hate speech in Portuguese. |
|
Derived from the [BERTimbau large](https://huggingface.co/neuralmind/bert-large-portuguese-cased), |
|
TuPy-Bert-Large-Multilabel is a refined solution for addressing categorical hate speech concerns (ageism, aporophobia, body shame, capacitism, LGBTphobia, political, |
|
racism, religious intolerance, misogyny, and xenophobia). |
|
For more details or specific inquiries, please refer to the [BERTimbau repository](https://github.com/neuralmind-ai/portuguese-bert/). |
|
|
|
The efficacy of Language Models can exhibit notable variations when confronted with a shift in domain between training and test data. |
|
In the creation of a specialized Portuguese Language Model tailored for hate speech classification, |
|
the original BERTimbau model underwent fine-tuning processe carried out on |
|
the [TuPy Hate Speech DataSet](https://huggingface.co/datasets/Silly-Machine/TuPyE-Dataset), sourced from diverse social networks. |
|
|
|
## Available models |
|
|
|
| Model | Arch. | #Layers | #Params | |
|
| ---------------------------------------- | ---------- | ------- | ------- | |
|
| `Silly-Machine/TuPy-Bert-Base-Binary-Classifier` | BERT-Base |12 |109M| |
|
| `Silly-Machine/TuPy-Bert-Large-Binary-Classifier` | BERT-Large | 24 | 334M | |
|
| `Silly-Machine/TuPy-Bert-Base-Multilabel` | BERT-Base | 12 | 109M | |
|
| `Silly-Machine/TuPy-Bert-Large-Multilabel` | BERT-Large | 24 | 334M | |
|
|
|
## Example usage |
|
|
|
```python |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig |
|
import torch |
|
import numpy as np |
|
from scipy.special import softmax |
|
|
|
def classify_hate_speech(model_name, text): |
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
config = AutoConfig.from_pretrained(model_name) |
|
|
|
# Tokenize input text and prepare model input |
|
model_input = tokenizer(text, padding=True, return_tensors="pt") |
|
|
|
# Get model output scores |
|
with torch.no_grad(): |
|
output = model(**model_input) |
|
scores = softmax(output.logits.numpy(), axis=1) |
|
ranking = np.argsort(scores[0])[::-1] |
|
|
|
# Print the results |
|
for i, rank in enumerate(ranking): |
|
label = config.id2label[rank] |
|
score = scores[0, rank] |
|
print(f"{i + 1}) Label: {label} Score: {score:.4f}") |
|
|
|
# Example usage |
|
model_name = "Silly-Machine/TuPy-Bert-Large-Multilabel" |
|
text = "Bom dia, flor do dia!!" |
|
classify_hate_speech(model_name, text) |
|
``` |