lodrick-the-lafted
/

llama-3.1-8b-instruct-ortho-v6

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

lodrick-the-lafted commited on Jul 26

Commit

d6e7a50

•

1 Parent(s): 3341e27

Create README.md

Files changed (1) hide show

README.md +13 -0

README.md ADDED Viewed

	@@ -0,0 +1,13 @@

+A few different attempts at orthogonalization/abliteration of llama-3.1-8b-instruct using variations of the method lied out in "Mechanistically Eliciting Latent Behaviors in Language Models". <br/>
+v1 & v2 were destined for the bit bucket <br/>
+<br/>
+Each of these use different vectors and have some variations in where the new refusal boundaries lie. None of them seem totally jailbroken.
+Advantage: only need to project down_proj for one layer, so there is usually very little brain damage. <br/>
+Disadvantage: using the difference of means method is precisely targetted, while this method requires filtering for interesting control vectors from a selection of prompts
+[https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v3](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v3) <br/>
+[https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v4](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v4) <br/>
+[https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v5](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v5) <br/>
+[https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v6](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v6) <br/>
+[https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v7](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v7) <br/>