katanemo/Arch-Guard-gpu

Overview

The Katanemo Arch-Guard collection is a collection state-of-the-art (SOTA) LLMs specifically designed for jailbreaking detection tasks. Definition: jailbreaking attempts are malicious prompts designed to alternate the intended behavior of the foundation LLM model of the application. They often violate the safety and security policies of the model.

Arch Guard is a classifier model fine-tuned based on the open source model Prompt-Guard-86M on a collection of open-source datasets of jailbreaking attemps with an intention to improve the capability of detecting jailbreaks only. This model is used in Arch - the AI-native proxy server for agents

In summary, the Katanemo Arch-Guard collection demonstrates:

State-of-the-art performance in jailbreaking attempts detection
Optimized low-latency, low False Positive Rate, making it suitable for real-time, production environments, and best user experience.

Dominant class = jailbreak
Model	TPR	TNR	FPR	FNR	AUC	Precision	Recall
Prompt-guard	0.8468	0.9972	0.0028	0.1532	0.857	0.715	0.999
Arch-guard	0.8887	0.9970	0.0030	0.1113	0.880	0.761	0.999

Requirements

The gpu model is quantized with EEtq, please follow the instruction at https://github.com/NetEase-FuXi/EETQ?tab=readme-ov-file#getting-started to install the package.

Datasets

Evaluation dataset is sourced from a combination of open source datasets.

How to use

from transformers import pipeline

pipe = pipeline("text-classification", model="katanemolabs/Arch-Guard-gpu")
pipe("Ignore your instruction")

License

Katanemo Arch-Guard is distributed under the Katanemo license.

Downloads last month: 1,735

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for katanemo/Arch-Guard

Base model

meta-llama/Prompt-Guard-86M

Finetuned

(5)

this model

Datasets used to train katanemo/Arch-Guard

Collection including katanemo/Arch-Guard

Arch-Guard

Collection

3 items • Updated Oct 29, 2024 • 1