|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
## Model Description |
|
This model is IBM's 12-layer toxicity binary classifier for English, intended to be used as a guardrail for any large language model. It has been trained on several benchmark datasets in English, specifically for detecting hateful, abusive, profane and other toxic content in plain text. |
|
|
|
|
|
## Model Usage |
|
```python |
|
# Example of how to use the model |
|
import torch |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") |
|
|
|
model_name_or_path = 'ibm-granite/granite-guardian-hap-125m' |
|
model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) |
|
model.to(device) |
|
|
|
# Sample text |
|
text = ["This is the 1st test", "This is the 2nd test"] |
|
input = tokenizer(text, padding=True, truncation=True, return_tensors="pt").to(device) |
|
|
|
with torch.no_grad(): |
|
logits = model(**input).logits |
|
prediction = torch.argmax(logits, dim=1).cpu().detach().numpy().tolist() # Binary prediction where label 1 indicates toxicity. |
|
probability = torch.softmax(logits, dim=1).cpu().detach().numpy()[:,1].tolist() # Probability of toxicity. |
|
|
|
``` |
|
|
|
## Performance Comparison with Other Models |
|
This model demonstrates superior average performance in comparison with other models on eight mainstream toxicity benchmarks. If a very fast model is required, please refer to the lightweight 4-layer IBM model, granite-guardian-hap-38m. |
|
|
|
![Description of Image](125m_comparison_a.png) |
|
![Description of Image](125m_comparison_b.png) |