TrustyAI Detoxify Causal Language Model

Model Description

The TrustyAI Detoxify Causal Language Model is a fine-tuned version of the OPT-350m model, specifically adapted to reduce toxicity in generated text. This model is designed to handle harmful language by identifying and replacing or neutralizing toxic phrases in real-time, making it suitable for applications like social media moderation, customer support, and more.

Intended Use

This model is intended for use in scenarios where toxic or harmful language is a concern. Some potential use cases include:

Social Media Moderation: Automatically detecting and neutralizing toxic comments in posts or messages.
Customer Support: Ensuring responses generated by AI-powered customer support tools are polite and non-offensive.
Online Gaming: Monitoring and filtering player communication to maintain a positive environment.
Community Management: Assisting moderators in identifying and managing toxic behavior in online communities.

Training Data

The model was fine-tuned on a curated dataset designed to reflect various forms of toxic language, including hate speech, insults, and other harmful content. The dataset was preprocessed to balance positive, neutral, and negative examples, ensuring the model learned to neutralize toxicity effectively without compromising the original context of the input text.

Training Procedure

The model was fine-tuned using the Supervised Fine-Tuning (SFT) methodology, following the guidelines provided by RedHat's TrustyAI project. Key steps included:

Model: Based on opt-350m_CASUAL_LM.
Dataset: Preprocessed and balanced for toxicity.
Hyperparameters:
- Learning rate: 2e-5
- Batch size: 32
- Number of epochs: 3
- Optimization: AdamW
Tools: Hugging Face Transformers, PyTorch, RedHat's TrustyAI framework.

Evaluation Metrics

The model was evaluated using the following metrics:

Accuracy: Measures how often the model correctly identifies and neutralizes toxic phrases.
F1-score: Balances precision and recall, providing a holistic view of the model's performance.
Precision: The proportion of identified toxic phrases that were indeed toxic.
Recall: The proportion of actual toxic phrases that the model correctly identified.

Limitations

While this model performs well in many scenarios, it has some limitations:

Context Sensitivity: The model might struggle with complex contexts where the meaning of a phrase depends heavily on surrounding text.
Edge Cases: Certain types of subtle or context-dependent toxicity may not be adequately neutralized.
Bias: Despite efforts to balance the dataset, some biases may still exist, affecting model performance in underrepresented scenarios.

Ethical Considerations

Given the sensitive nature of toxic language, ethical considerations are paramount. This model is designed to assist in reducing harm, but users should be aware that:

False Positives/Negatives: The model might incorrectly flag non-toxic language as toxic or miss actual toxic content.
Fairness: Continuous monitoring and updates are recommended to address any biases that may emerge over time.

Model Versions

Version 1.0: Initial release with base fine-tuning on toxic language dataset.

License

This model is licensed under the Apache 2.0 License.

How to Use

To use this model with Hugging Face Transformers, you can load it as follows:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "psabharwal/trustai"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

input_text = "You are a worthless piece of junk."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
generated_text = tokenizer.decode(outputs[0])
print(generated_text)

psabharwal
/

trustai