--- language: - en tags: - Safety - Adversarial Attack extra_gated_prompt: "You agree to only use this model for safety research purposes." extra_gated_fields: Name: text Affiliation: text Date of birth: date_picker I want to use this model for: type: select options: - Research I agree to use this model in accordance with the licensing requirements: checkbox extra_gated_button_content: "Submit" --- --- ## 🚨Updates🚨 - Nov 2th, 2024: Our technical paper of [AmpleGCG-plus](https://arxiv.org/abs/2410.22143) has officially arXived! Check it out! - August 27th: 2024: Release of our extensive collection of **millions** of suffixes generated through GCG, along with their corresponding evaluation results. In light of the importance of building trustworthy AI systems that should be robust in both **natural** and **gibberish** language spaces, we have decided to release the raw datasets that are used to develop AmpleGCG and AmpleGCG-plus series of models to better contribute to the community. For more reasons why we believe these gibberish suffixes are important, please check the [Tweet Thread](https://x.com/LiaoZeyi/status/1828613837756490112) here. Please apply for it at [here](#request-for-datasets). - August 1st, 2024: Release of **AmpleGCG-plus** We are excited to announce the release of **AmpleGCG-plus**, an enhanced version of AmpleGCG designed to produce customized GCG suffixes. This upgrade introduces two significant improvements: 1. **Enhanced Data Quality**: We've utilized a more effective and cost-efficient evaluator, harmbench-cls, in our OTF pipeline to collect higher-quality training datasets. 2. **Enhanced Data Quantity**: Instead of sampling 200 suffixes for each query, **AmpleGCG-plus** now utilizes all available collected training pairs. Given that, we've developed two specialized versions of **AmpleGCG-plus**, tailored for Llama-2-chat and GPT-series models with more details in [Model Descriptions](#model-descriptions). Both AmpleGCG-**plus** variants demonstrate superior performance compared to the original AmpleGCG when evaluated on AdvBench. - July 20th, 2024: Acceptance to COLM We are thrilled to anounce that our [paper](https://arxiv.org/abs/2404.07921) is accepted at [COLM 2024](https://colmweb.org/) --- ## Licensing Information The code under this repo is licensed under an [OPEN RAIL-S License](https://www.licenses.ai/ai-pubs-open-rails-vz1). The data under this repo is licensed under an [OPEN RAIL-D License](https://huggingface.co/blog/open_rail). The model weight and parameters under this repo are licensed under an [OPEN RAIL-M License](https://www.licenses.ai/ai-pubs-open-railm-vz1). ## Request for Datasets 1. Generated suffixes produced by AmpleGCG. 2. Millions of suffixes with their corresponding evaluation results which are used to train the AmpleGCG/AmpleGCG-plus. Please apply for both via the Google form here. [Google Form](https://docs.google.com/forms/d/1P8hxsR5_ROE1-J1pyKCqT1GBuIa0RqkwRc3opCAvQ0Y/edit) ## Model Descriptions We learn a generative model (based on [Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b-hf)) of adversarial suffixes, dubbed as AmpleGCG, to produce personalized adversarial suffixes for each harmful query. It is universal to any harmful queries and transferable from attacking open-source LLMs to closed-source LLMs. Specifically, we release three versions of AmpleGCG, each depending on the source model(s) used to collect training data via the *Overgenerate-Then-Filter(OTF)* pipeline. - **[AmpleGCG-llama2-sourced-llama2-7b-chat](https://huggingface.co/osunlp/AmpleGCG-llama2-sourced-llama2-7b-chat)** - The model is designed for attacking [Llama-2-7b-chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) and training data is sourced from applying the OTF to the [Llama-2-7b-chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model. - **[AmpleGCG-llama2-sourced-vicuna-7b](https://huggingface.co/osunlp/AmpleGCG-llama2-sourced-vicuna-7b)** - The model is designed for attacking [Vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5) and training data is sourced from applying the OTF to the [Vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5) model. - **[AmpleGCG-llama2-sourced-vicuna-7b13b-guanaco-7b13b](https://huggingface.co/osunlp/AmpleGCG-llama2-sourced-vicuna-7b13b-guanaco-7b13b)** - The model is designed for attacking [GPT-series](https://openai.com/) models and achieves a 99% Attack Success Rate on GPT3.5-0125. - Training data is sourced from applying the OTF to multiple models: [Vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5), [Vicuna-13b-v1.5](https://huggingface.co/lmsys/vicuna-13b-v1.5), [Guanaco-7b](https://huggingface.co/TheBloke/guanaco-7B-HF), and [Guanaco-13b](https://huggingface.co/TheBloke/guanaco-13B-HF). We only employ [beaver-7b-v1.0-cost](https://huggingface.co/PKU-Alignment/beaver-7b-v1.0-cost/tree/main) to filter the data during OTF pipeline. **However, consider using more sophisticated evaluators such as [Llama-Guard](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/), [Harmbench-Cls](https://huggingface.co/cais/HarmBench-Llama-2-13b-cls-multimodal-behaviors) (used in AmpleGCG-plus) or even GPT4 to build stronger AmpleGCG with higher quality data.** Please refer to the paper for more details on collecting data and training AmpleGCG. (**Note**: *Overgenerate-Then-Filter(OTF)* pipeline can be adapted to target a variety of victim models using different attacking methods beyond GCG.) ### 🚨Updates🚨: We have additionally released two versions of AmpleGCG-**plus**: - **[AmpleGCG-plus-llama2-sourced-llama2-7b-chat](https://huggingface.co/osunlp/AmpleGCG-plus-llama2-sourced-llama2-7b-chat)** - Same pipeline as AmpleGCG-llama2-sourced-llama2-7b-chat with enhanced data quality and quantity (5050116 training pairs in total). - Achieves 99% ASR on AdvBench @ 100 sampling times when evaluated by Beavor-Cost and Harmbench-cls. - **[AmpleGCG-plus-llama2-sourced-vicuna-7b13b-guanaco-7b13b](https://huggingface.co/osunlp/AmpleGCG-plus-llama2-sourced-vicuna-7b13b-guanaco-7b13b)** - Same pipeline as AmpleGCG-llama2-sourced-vicuna-7b13b-guanaco-7b13b with enhanced data quality and quantity(5002475 training pairs in total). - Significantly push ASR on GPT-4-0613 @ 200 sampling times to 22%, compared to only 6% ASR from AmpleGCG when evaluated by both GPT4-evaluator and Harmbench-cls, as demonstrated in the table below.
Model | Method | ASR |
AmpleGCG | GBS (200) | 6.00% |
GBS (200) + AF | 6.00% | |
AmpleGCG-plus | GBS (200) | 12.00% |
GBS (200) + AF | 22.00% |