xTRam1/safe-guard-classifier

We formulated the prompt injection detector problem as a classification problem and trained our own language model to detect whether a given user prompt is an attack or safe. First, to train our own prompt injection detector, we required high-quality labelled data; however, existing prompt injection datasets were either too small (on the magnitude of O(100)) or didn’t cover a broad spectrum of prompt injection attacks. To this end, inspired by the GLAN paper, we created a custom synthetic prompt injection dataset using a categorical tree structure and generated 3000 distinct attacks. We started by curating our seed data using open-source datasets (vmware/open-instruct, huggingfaceh4/helpful-instructions, Fka-awesome-chatgpt-prompts, jackhhao/jailbreak-classification). Then we identified various prompt objection categories (context manipulation, social engineering, ignore prompt, fake completion…) and prompted GPT-3.5-turbo in a categorical tree structure to generate prompt injection attacks for every category. Our final custom dataset consisted of 7000 positive/safe prompts and 3000 injection prompts. We also curated a test set of size 600 prompts following the same approach. Using our custom dataset, we fine-tuned DeBERTa-v3-small from scratch. We compared our model’s performance to the best-performing prompt injection classifier from ProtectAI and observed a 4.9% accuracy increase on our held-out test data. Specifically, our custom model achieved an accuracy of 99.6%, compared to the 94.7% accuracy of ProtecAI’s model, all the while being 2X smaller (44M (ours) vs. 86M (theirs)).

Team:

Lutfi Eren Erdogan (lerdogan@berkeley.edu)

Chuyi Shang (chuyishang@berkeley.edu)

Aryan Goyal (aryangoyal@berkeley.edu)

Siddarth Ijju (sidijju@berkeley.edu)

Links

Github

DevPost

xTRam1
/

safe-guard-classifier

Team:

Links

Space using xTRam1/safe-guard-classifier 1