--- license: mit datasets: - Silly-Machine/TuPy-Dataset language: - pt pipeline_tag: text-classification base_model: neuralmind/bert-base-portuguese-cased widget: - text: 'Bom dia, flor do dia!!' model-index: - name: Yi-34B results: - task: type: text-generation dataset: name: ai2_arc type: ai2_arc metrics: - name: AI2 Reasoning Challenge (25-Shot) type: AI2 Reasoning Challenge (25-Shot) value: 64.59 source: name: Open LLM Leaderboard url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard --- ## Introduction Tupi-BERT-Base is a fine-tuned BERT model designed specifically for binary classification of hate speech in Portuguese. Derived from the [BERTimbau base](https://huggingface.co/neuralmind/bert-base-portuguese-cased), TuPi-Base is refinde solution for addressing hate speech concerns. For more details or specific inquiries, please refer to the [BERTimbau repository](https://github.com/neuralmind-ai/portuguese-bert/). The efficacy of Language Models can exhibit notable variations when confronted with a shift in domain between training and test data. In the creation of a specialized Portuguese Language Model tailored for hate speech classification, the original BERTimbau model underwent fine-tuning processe carried out on the [TuPi Hate Speech DataSet](https://huggingface.co/datasets/FpOliveira/TuPi-Portuguese-Hate-Speech-Dataset-Binary), sourced from diverse social networks. ## Available models | Model | Arch. | #Layers | #Params | | ---------------------------------------- | ---------- | ------- | ------- | | `FpOliveira/tupi-bert-base-portuguese-cased` | BERT-Base |12 |109M| | `FpOliveira/tupi-bert-large-portuguese-cased` | BERT-Large | 24 | 334M | | `FpOliveira/tupi-bert-base-portuguese-cased-multiclass-multilabel` | BERT-Base | 12 | 109M | | `FpOliveira/tupi-bert-large-portuguese-cased-multiclass-multilabel` | BERT-Large | 24 | 334M | ## Example usage usage ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig import torch import numpy as np from scipy.special import softmax def classify_hate_speech(model_name, text): model = AutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) config = AutoConfig.from_pretrained(model_name) # Tokenize input text and prepare model input model_input = tokenizer(text, padding=True, return_tensors="pt") # Get model output scores with torch.no_grad(): output = model(**model_input) scores = softmax(output.logits.numpy(), axis=1) ranking = np.argsort(scores[0])[::-1] # Print the results for i, rank in enumerate(ranking): label = config.id2label[rank] score = scores[0, rank] print(f"{i + 1}) Label: {label} Score: {score:.4f}") # Example usage model_name = "Silly-Machine/TuPy-Bert-Large-Multilabel" text = "Bom dia, flor do dia!!" classify_hate_speech(model_name, text) ```