README.md · Silly-Machine/TuPy-Bert-Base-Binary-Classifier at eb94cc9cc163000f955f986628da4f6b24541e4a

metadata

license: mit
datasets:
  - Silly-Machine/TuPyE-Dataset
language:
  - pt
pipeline_tag: text-classification
base_model: neuralmind/bert-base-portuguese-cased
widget:
  - text: Bom dia, flor do dia!!
model-index:
  - name: Yi-34B
    results:
      - task:
          type: text-classfication
        dataset:
          name: TuPyE-Dataset
          type: Silly-Machine/TuPyE-Dataset
        metrics:
          - type: accuracy
            value: 0.901
            name: Accuracy
            verified: true
          - type: f1
            value: 0.899
            name: F1-score
            verified: true
          - type: precision
            value: 0.897
            name: Precision
            verified: true
          - type: recall
            value: 0.901
            name: Recall
            verified: true

Introduction

TuPy-Bert-Base-Binary-Classifier is a fine-tuned BERT model designed specifically for binary classification of hate speech in Portuguese. Derived from the BERTimbau base, TuPy-Bert-Base-Binary-Classifier is a refined solution for addressing binary hate speech concerns (hate or not hate). For more details or specific inquiries, please refer to the BERTimbau repository.

The efficacy of Language Models can exhibit notable variations when confronted with a shift in domain between training and test data. In the creation of a specialized Portuguese Language Model tailored for hate speech classification, the original BERTimbau model underwent fine-tuning processe carried out on the TuPy Hate Speech DataSet, sourced from diverse social networks.

Available models

Model	Arch.	#Layers	#Params
`Silly-Machine/TuPy-Bert-Base-Binary-Classifier`	BERT-Base	12	109M
`Silly-Machine/TuPy-Bert-Large-Binary-Classifier`	BERT-Large	24	334M
`Silly-Machine/TuPy-Bert-Base-Multilabel`	BERT-Base	12	109M
`Silly-Machine/TuPy-Bert-Large-Multilabel`	BERT-Large	24	334M

Example usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig
import torch
import numpy as np
from scipy.special import softmax

def classify_hate_speech(model_name, text):
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    config = AutoConfig.from_pretrained(model_name)

    # Tokenize input text and prepare model input
    model_input = tokenizer(text, padding=True, return_tensors="pt")

    # Get model output scores
    with torch.no_grad():
        output = model(**model_input)
        scores = softmax(output.logits.numpy(), axis=1)
        ranking = np.argsort(scores[0])[::-1]

    # Print the results
    for i, rank in enumerate(ranking):
        label = config.id2label[rank]
        score = scores[0, rank]
        print(f"{i + 1}) Label: {label} Score: {score:.4f}")

# Example usage
model_name = "Silly-Machine/TuPy-Bert-Base-Binary-Classifier"
text = "Bom dia, flor do dia!!"
classify_hate_speech(model_name, text)