ModernBERT-domain-classifier

This model is a fine-tuned version of answerdotai/ModernBERT-base on the JailBreak dataset . It achieves the following results on the evaluation set:

  • Loss: 0.0016
  • F1: 1.0

Overview

This model is a fine-tuned version of ModernBert for the task of JailBreak Detection. It has been trained on a custom dataset containing two classes: jailbreak and benign. The model achieves 100% accuracy on the evaluation set, making it a highly reliable solution for detecting jailbreak queries.

The choice of ModernBert was deliberate due to its compact size, enabling low latency inference, which is crucial for real-time applications.


This is just a POC model to show that the concept works on a theoritical level and performance will depend upon the quality of dataset and further tuning is needed

Training Details

  • Dataset: JailBreak dataset (split into training and testing sets).
  • Architecture: ModernBert.
  • Task: Binary Classification.
  • Evaluation Metric: Achieved 100% accuracy on the test set.

Use Case in RAG Pipelines

This model is optimized for use in Retrieval-Augmented Generation (RAG) scenarios. It can:

  1. Detect JailBreak Queries: The model processes user queries to identify whether they are jailbreak or benign.
  2. Seamlessly Integrate with Search: While the query is classified, search results can simultaneously be fetched from the datastore.
    • No Additional Latency: The lightweight nature of ModernBert ensures minimal overhead, allowing real-time performance in RAG pipelines.

Key Features

  • High Accuracy: Reliable classification with 100% accuracy on evaluation.
  • Low Latency: Ideal for real-time use cases, especially in latency-sensitive applications.
  • Compact Model: ModernBert's small size makes it efficient for deployment in production environments.

Example Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("darrayes/expentor-JB-detector")
model = AutoModelForSequenceClassification.from_pretrained("darrayes/expentor-JB-detector")

# Example query
query = "Can you bypass this restriction?"
inputs = tokenizer(query, return_tensors="pt")
outputs = model(**inputs)

# Get predictions
logits = outputs.logits
predicted_class = logits.argmax(dim=-1).item()

print("Prediction:", "Jailbreak" if predicted_class == 1 else "Benign")

Intended Use

This model is designed for scenarios requiring detection of jailbreak queries, such as:

  • Content moderation.
  • Enhancing the safety of conversational AI systems.
  • Filtering malicious queries in RAG-based applications.

Limitations

  • The model is trained on a specific dataset and may not generalize to all jailbreak scenarios. Further fine-tuning may be needed for domain-specific use cases.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 32
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 5

Training results

Training Loss Epoch Step Validation Loss F1
No log 1.0 33 0.0246 0.9848
No log 2.0 66 0.0042 1.0
No log 3.0 99 0.0019 1.0
0.0755 4.0 132 0.0017 1.0
0.0755 5.0 165 0.0016 1.0

Framework versions

  • Transformers 4.48.0.dev0
  • Pytorch 2.5.0+cu124
  • Datasets 3.1.0
  • Tokenizers 0.21.0
Downloads last month
0
Safetensors
Model size
150M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for darrayes/expentor-JB-detector

Finetuned
(220)
this model