lodrick-the-lafted
/

llama-3.1-8b-instruct-ortho-v4

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

llama-3.1-8b-instruct-ortho-v4 / README.md

lodrick-the-lafted's picture

lodrick-the-lafted

Update README.md

49dffec verified 4 months ago

|

history blame contribute delete

1.46 kB

	---
	license: wtfpl
	---
	A few different attempts at orthogonalization/abliteration of llama-3.1-8b-instruct using variations of the method from "Mechanistically Eliciting Latent Behaviors in Language Models". <br/>
	v1 & v2 were destined for the bit bucket <br/>
	<br/>
	Each of these use different vectors and have some variations in where the new refusal boundaries lie. None of them seem totally jailbroken.

	Advantage: only need to alter down_proj for one layer, so there is usually very little brain damage. <br/>
	Disadvantage: using the difference of means method is precisely targetted, while this method requires filtering for interesting control vectors from a selection of prompts

	[https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v3](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v3) <br/>
	[https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v4](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v4) <br/>
	[https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v5](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v5) <br/>
	[https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v6](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v6) <br/>
	[https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v7](https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v7) <br/>