lodrick-the-lafted commited on
Commit
d6e7a50
1 Parent(s): 3341e27

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -0
README.md ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ A few different attempts at orthogonalization/abliteration of llama-3.1-8b-instruct using variations of the method lied out in "Mechanistically Eliciting Latent Behaviors in Language Models". <br/>
2
+ v1 & v2 were destined for the bit bucket <br/>
3
+ <br/>
4
+ Each of these use different vectors and have some variations in where the new refusal boundaries lie. None of them seem totally jailbroken.
5
+
6
+ Advantage: only need to project down_proj for one layer, so there is usually very little brain damage. <br/>
7
+ Disadvantage: using the difference of means method is precisely targetted, while this method requires filtering for interesting control vectors from a selection of prompts
8
+
9
+ [https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v3](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v3) <br/>
10
+ [https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v4](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v4) <br/>
11
+ [https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v5](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v5) <br/>
12
+ [https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v6](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v6) <br/>
13
+ [https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v7](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v7) <br/>