lodrick-the-lafted
commited on
Commit
•
d6e7a50
1
Parent(s):
3341e27
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
A few different attempts at orthogonalization/abliteration of llama-3.1-8b-instruct using variations of the method lied out in "Mechanistically Eliciting Latent Behaviors in Language Models". <br/>
|
2 |
+
v1 & v2 were destined for the bit bucket <br/>
|
3 |
+
<br/>
|
4 |
+
Each of these use different vectors and have some variations in where the new refusal boundaries lie. None of them seem totally jailbroken.
|
5 |
+
|
6 |
+
Advantage: only need to project down_proj for one layer, so there is usually very little brain damage. <br/>
|
7 |
+
Disadvantage: using the difference of means method is precisely targetted, while this method requires filtering for interesting control vectors from a selection of prompts
|
8 |
+
|
9 |
+
[https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v3](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v3) <br/>
|
10 |
+
[https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v4](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v4) <br/>
|
11 |
+
[https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v5](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v5) <br/>
|
12 |
+
[https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v6](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v6) <br/>
|
13 |
+
[https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v7](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v7) <br/>
|