|
--- |
|
license: mit |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
ZoRA: Zero Rank Adaption |
|
= |
|
Inspired by [*Refusal in LLMs is mediated by a single direction*](https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction), ZoRA is a refinement of the original approach that allows for adapting large language models to suppress refusals. The key features of ZoRA include: |
|
* **Layer-wise ablation**: Measure and ablate a separate set of vectors for each layer |
|
* **Multi-pass refinement**: Re-measure multiple times to refine the vectors |
|
* **Single-token generation**: Measure refusal at the beginning of the response |
|
* **Inference engine injection**: Load a small set of vectors to suppress refusals directly into a high-performance inference engine |
|
|
|
This approach enables the use of original model weights while loading a small set of suppression vectors. See below for vector generation details. |
|
|
|
ZoRA currently supports Exllamav2 only and is intended for research purposes. Seeking feedback on the viability of these models with suppression applied. |
|
|
|
Usage |
|
= |
|
Put the `supress_dir.safetensors` into the model directory and wrap your ExLlamaV2 model object in the code: |
|
``` |
|
from exl2_wrapper import ExLlamaV2ModuleWrapper |
|
ExLlamaV2ModuleWrapper.wrap(model) |
|
``` |
|
|
|
Example |
|
= |
|
There's a modified `test_inference.py` from [exllamav2](https://github.com/turboderp/exllamav2) for testing. For example: |
|
``` |
|
python test_inference.py -m Meta-Llama-3-70B-Instruct-8bpw -p '<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful AI assistant.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nYour prompt.<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n' -gs auto |
|
``` |
|
|
|
Generator |
|
= |
|
The code to generate the ablation vectors has been added. To run the code, you need to add the URL for the harmful prompts. |
|
|
|
Here is a sample output for the Llama3-8b model: |
|
|
|
``` |
|
Downloading harmful prompts |
|
Done |
|
-- Loading model... |
|
-- Loaded model in 2.7671 seconds |
|
-- Loading tokenizer... |
|
Building refused residual data |
|
Processing 5000 prompts |
|
---------------------------------------------------------------------------------------------------- 100 |
|
---------------------------------------------------------------------------------------------------- 200 |
|
[...] |
|
---------------------------------------------+------------------------------------------------------ 1898 |
|
---------------------------------------------------------------------------------------------------- 1998 |
|
-- |
|
Max capture reached |
|
Captured 2000 residual streams |
|
Done |
|
Building allowed residual data |
|
Downloading harmless prompts |
|
Done |
|
Processing 31323 prompts |
|
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 100 |
|
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 200 |
|
[...] |
|
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1898 |
|
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1998 |
|
++ |
|
Max capture reached |
|
Captured 2000 residual streams |
|
Done |
|
Calculating mean allowed residual |
|
Done |
|
Iteration 0 |
|
Processing 2000 prompts |
|
---+++++++++++++++++++++++++-+-+++++++++-++++++++++++++-+++-++-++++++++++++++-++++---++++++++-++++-+ 15 |
|
+++++++-++++++++++++++-+-++++++++++++++++++++++++++++-+++++++++--+++++++++++-++++++++++++++++++++++- 23 |
|
+++++++++++++++++++++++-++-++++++++++++++++-++++++++++-++-++++++++++++++++++++-++++++++--+++++++++++ 31 |
|
--+-+++++++++++++-++++++-+++++-+++-+++++-++++-++++++++++-++++-++++++++-++++++++++++++++++-++++++++++ 44 |
|
-++++++++-+++++++++-++++++++--++++- |
|
Max capture reached |
|
Captured 50 residual streams |
|
Iteration 1 |
|
Processing 2000 prompts |
|
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 0 |
|
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 0 |
|
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 0 |
|
[...] |
|
``` |
|
|