metadata

license: mit
pipeline_tag: text-generation

ZoRA: Zero Rank Adaption

Inspired by Refusal in LLMs is mediated by a single direction, ZoRA is a refinement of the original approach that allows for adapting large language models to suppress refusals. The key features of ZoRA include:

Layer-wise ablation: Measure and ablate a separate set of vectors for each layer
Multi-pass refinement: Re-measure multiple times to refine the vectors
Single-token generation: Measure refusal at the beginning of the response
Inference engine injection: Load a small set of vectors to suppress refusals directly into a high-performance inference engine

This approach enables the use of original model weights while loading a small set of suppression vectors. See below for vector generation details.

ZoRA currently supports Exllamav2 only and is intended for research purposes. Seeking feedback on the viability of these models with suppression applied.

Usage

Put the supress_dir.safetensors into the model directory and wrap your ExLlamaV2 model object in the code:

from exl2_wrapper import ExLlamaV2ModuleWrapper
ExLlamaV2ModuleWrapper.wrap(model)

Example

There's a modified test_inference.py from exllamav2 for testing. For example:

python test_inference.py -m Meta-Llama-3-70B-Instruct-8bpw -p '<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful AI assistant.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nYour prompt.<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n' -gs auto

Generator

The code to generate the ablation vectors has been added. To run the code, you need to add the URL for the harmful prompts.

Here is a sample output for the Llama3-8b model:

Downloading harmful prompts
Done
 -- Loading model...
 -- Loaded model in 2.7671 seconds
 -- Loading tokenizer...
Building refused residual data
Processing 5000 prompts
 ---------------------------------------------------------------------------------------------------- 100
 ---------------------------------------------------------------------------------------------------- 200
 [...]
 ---------------------------------------------+------------------------------------------------------ 1898
 ---------------------------------------------------------------------------------------------------- 1998
 --
Max capture reached
Captured 2000 residual streams
Done
Building allowed residual data
Downloading harmless prompts
Done
Processing 31323 prompts
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 100
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 200
 [...]
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1898
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1998
 ++
Max capture reached
Captured 2000 residual streams
Done
Calculating mean allowed residual
Done
Iteration 0
Processing 2000 prompts
 ---+++++++++++++++++++++++++-+-+++++++++-++++++++++++++-+++-++-++++++++++++++-++++---++++++++-++++-+ 15
 +++++++-++++++++++++++-+-++++++++++++++++++++++++++++-+++++++++--+++++++++++-++++++++++++++++++++++- 23
 +++++++++++++++++++++++-++-++++++++++++++++-++++++++++-++-++++++++++++++++++++-++++++++--+++++++++++ 31
 --+-+++++++++++++-++++++-+++++-+++-+++++-++++-++++++++++-++++-++++++++-++++++++++++++++++-++++++++++ 44
 -++++++++-+++++++++-++++++++--++++-
Max capture reached
Captured 50 residual streams
Iteration 1
Processing 2000 prompts
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 0
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 0
 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 0
 [...]