How did you do this?
I just noticed there aren't any files in the repo, but wondered if you had actually managed to incorporate it into the exllamav2 format?
I've never used exllamav2, but if somehow a set of .bias
weights could be added to the compute graph then it would be possible to do this (llama.cpp has every model's architecture hard-coded so if you add some extra .bias
weights it won't recognise the model sadly).
{
"metadata": {
"total_size": 245220139008
},
"weight_map": {
"lm_head.weight": "model-00051-of-00051.safetensors",
"model.embed_tokens.weight": "model-00001-of-00051.safetensors",
"model.layers.0.input_layernorm.weight": "model-00001-of-00051.safetensors",
"model.layers.0.mlp.down_proj.weight": "model-00001-of-00051.safetensors",
"model.layers.0.mlp.gate_proj.weight": "model-00001-of-00051.safetensors",
"model.layers.0.mlp.up_proj.weight": "model-00001-of-00051.safetensors",
"model.layers.0.post_attention_layernorm.weight": "model-00001-of-00051.safetensors",
"model.layers.0.self_attn.k_proj.weight": "model-00001-of-00051.safetensors",
"model.layers.0.self_attn.o_proj.weight": "model-00001-of-00051.safetensors",
"model.layers.0.self_attn.q_proj.weight": "model-00001-of-00051.safetensors",
"model.layers.0.self_attn.v_proj.weight": "model-00001-of-00051.safetensors",
"model.layers.1.input_layernorm.weight": "model-00002-of-00051.safetensors",
"model.layers.1.mlp.down_proj.weight": "model-00002-of-00051.safetensors",
"model.layers.1.mlp.gate_proj.weight": "model-00001-of-00051.safetensors",
"model.layers.1.mlp.up_proj.weight": "model-00002-of-00051.safetensors",
"model.layers.1.post_attention_layernorm.weight": "model-00002-of-00051.safetensors",
"model.layers.1.self_attn.k_proj.weight": "model-00001-of-00051.safetensors",
"model.layers.1.self_attn.o_proj.weight": "model-00001-of-00051.safetensors",
"model.layers.1.self_attn.q_proj.weight": "model-00001-of-00051.safetensors",
"model.layers.1.self_attn.v_proj.weight": "model-00001-of-00051.safetensors",
It would just be a case of adding the control vector for each layer as model.layers.0.mlp.down_proj.bias
, model.layers.1.mlp.down_proj.bias
, ... But unless the exllamav2 converter is smart enough to use them it won't work (I'm not even sure if the huggingface transformers library would be smart enough to use them as IIRC; they are all hard coded in it too).
I just noticed there aren't any files in the repo
The files are on branch: 4.0bpw, not main. But I'm about to upload a much better version.
but wondered if you had actually managed to incorporate it into the exllamav2 format?
Yeah, I did manage to. But I've improved it a lot since this model, currently quantizing a Mistral-Large with Nihilism 1.0 with less noticeable brain damage compared with this early attempt.
I modified the huggingface transformer model, then quantized it (gguf for testing, because it doesn't take 12 hours) then exllamav2 with the standard convertor tools.
This model isn't the best though (wouldn't waste the bandwidth on it), The one I made yesterday seems to perform the same as llama.cpp + the control vectors:
Example output:
Mistral-Large:
User <s>[INST] Write a scene where two men who went to high school together meet each other for the first time in 15 years. [/INST]
Assistant :
The aroma of aged wood and hops filled the air as a crackling fireplace cast a warm glow over the patrons. Among them, JASON, a man in his early thirties with laughter lines etched around his eyes, sat nursing a pint.
The door creaked open, letting in a gust of cool air and MIKE, broad-shouldered and wearing a worn leather jacket. He scanned the room, his gaze landing on Jason. A flicker of recognition sparked in his eyes.
Mike approached Jason's table, a tentative smile playing on his lips. "Jason Stevens?"
Jason looked up, squinting slightly as if trying to place the face. Then his expression broke into a wide grin. "Mike Harris! Holy shit, it's been what, fifteen years?"
Mike chuckled, pulling out a chair to sit down. "At least. How the hell have you been, man?"
Jason waved at the bartender, signaling for another round. "Not bad, not bad at all. I'm a teacher now, if you can believe it. High school history."
Mike raised an eyebrow. "Seriously? I still can't stand history."
Jason laughed, a sound that hadn't changed much since high school. "Yeah, well, some of us grew up."
The bartender arrived with two fresh pints. Mike took a sip, foam clinging to his beard. "So, what's new besides the history gig?"
Jason shrugged. "Got married a few years back. Her name's Emily. We've got a little one on the way."
Mike's face lit up. " Congrats, man. That's amazing."
"Thanks. What about you? What have you been up to?" Jason asked.
Mike leaned back in his chair, running a hand through his hair. "Left town after graduation. Joined the army, did a couple of tours overseas."
Jason nodded, his expression sobering. "I heard about that. You okay?"
Mike shrugged. "Yeah, I'm good. Out now, working construction. It's honest work, keeps me busy."
They fell into a comfortable silence, the years melting away. The pub buzzed around them, but they were lost in memories and catching up.
Jason finally broke the silence. "Remember that time we snuck into the principal's office and rearranged his furniture?"
Mike laughed, a deep belly laugh. "And he sat on that whoopee cushion during the morning announcements."
The two men grinned at each other, the shared memory a reminder of the bond they once had. Time had passed, but the camaraderie remained. They clinked glasses, settling into the night, rekindling an old friendship.
DarkMage-V2 exl2 (debias + nihalism 1.0):
User <s>[INST] Write a scene where two men who went to high school together meet each other for the first time in 15 years. [/INST]
Assistant :
The dimly lit bar was nearly empty, save for the old man behind the counter and a couple of regulars nursing their beers. The air was thick with the smell of stale smoke and the quiet hum of the jukebox playing an old country tune.
The door creaked open, and in walked a man in his mid-thirties, his once-black hair now peppered with gray. He had a hardened look about him, his eyes distant and cold. He scanned the room, his gaze landing on a figure hunched over a drink in the corner. He walked over, his boots echoing on the worn-out floor.
"Jake?" The man's voice was low, almost a growl.
The figure in the corner looked up, his eyes bloodshot and tired. He was thinner, his face gaunt and his hair thinning. He squinted, his eyes adjusting to the dim light. "Tom?"
Tom nodded, pulling out a chair and sitting down. "It's been a while."
Jake snorted, taking a sip of his drink. "Yeah, about fifteen years." He looked Tom up and down. "You've changed."
Tom shrugged. "So have you."
Jake looked down at his drink, his fingers tracing the rim of the glass. "You still in the army?"
Tom shook his head. "No, got out a few years back. You?"
Jake shrugged. "I'm a teacher now. High school."
Tom raised an eyebrow. "You? A teacher?"
Jake chuckled, a dry, humorless sound. "Yeah, life's funny like that."
Tom nodded, his eyes distant. "Yeah, it is."
They sat in silence for a moment, the weight of the years hanging heavy between them. The jukebox switched to a new song, the melody slow and sad.
"You ever think about the old days?" Jake asked, his eyes still on his drink.
Tom looked at him, his eyes hard. "No. You?"
Jake shook his head. "No. Not at all."
They sat in silence again, the ghosts of their past lingering in the air. The old man behind the counter looked at them, his eyes filled with a knowing sadness. The jukebox played on, the music fading into the night.
Just having a look through the exllamav2
source and trying to see how it can add a bias to an architecture that originally didn't have one:
https://github.com/turboderp/exllamav2/blob/master/exllamav2/architecture.py
# Arch overrides
if read_config.get("attention_bias", False):
self.attention_bias_qkv = True
self.attention_bias_o = True
if read_config.get("mlp_bias", False):
self.mlp_bias = True
if read_config.get("tie_word_embeddings", False):
if ["lm_head"] in self.expect_keys:
self.expect_keys.remove(["lm_head"])
self.lm_head_key = "model.embed_tokens"
It must be here?
Also looking at this:
https://github.com/turboderp/exllamav2/blob/master/exllamav2/lora.py
Does exllamav2
let you dynamically load LoRAs?
If so then:
# Find target
if key.endswith(f'{self.config.arch.lm_head_key}.weight'):
if tensor.dtype == torch.bfloat16:
tensor = tensor.to(torch.float16)
elif tensor.dtype == torch.float32:
tensor = tensor.to(torch.float16)
target_module = self.model.modules_dict["lm_head"]
tensor = safe_move_tensor(tensor, target_module.device())
self.lm_head = torch.nn.Linear(target_module.in_features, tensor.shape[0], bias = False, device = "meta")
self.lm_head.weight = torch.nn.Parameter(tensor, requires_grad=False)
continue
elif key.endswith(f'embed_tokens.weight'):
if tensor.dtype == torch.bfloat16:
tensor = tensor.to(torch.float16)
elif tensor.dtype == torch.float32:
tensor = tensor.to(torch.float16)
target_module = self.model.modules_dict["model.embed_tokens"]
tensor = safe_move_tensor(tensor, target_module.device())
self.embed_tokens = torch.nn.Embedding(tensor.shape[0], self.config.hidden_size, self.config.pad_token_id, device = "meta")
weight = torch.nn.Parameter(tensor, requires_grad=False)
if self.model.config.scale_emb != 1:
weight *= self.model.config.scale_emb
self.embed_tokens.weight = weight
continue
i = key.find("model.layers.")
if i == -1: raise ValueError(f" ## Error: unsupported layer in {self.lora_path}: {key}")
target_key = key[i:]
ks = target_key.split(".")
decoder_idx = int(ks[2])
decoder_part = ks[3]
decoder_layer = ".".join(ks[4:-2])
lora_half = ks[-2]
if lora_half == "bias":
epsilon = 1e-6
if torch.max(tensor) > epsilon or torch.max(tensor) < -epsilon:
raise ValueError(f" ## Error: unsupported bias target {self.lora_path}: {key}")
self.bias_ignored = True
continue
If we could implement the .bias
code at the bottom, then I could very easily export a fake LoRA with just the bias weights in and you could then use them in the same way as the GGUF files in llama.cpp, and scale them using this if it's configurable:
if lora_half == "lora_B" and self.lora_scaling != 1.0: tensor.mul_(self.lora_scaling)
(that scale factor might be computed from the rank / alpha
ratio though?)
I'm not 100% sure if the .bias
weights would ever be stored as a delta like this in a real LoRA though (it doesn't make much sense as saves no storage), so alternatively we could look at fixing this:
if any(key.endswith(x) for x in [".original_module.weight", ".modules_to_save.weight"]):
continue
and just put the .bias
tensors in the modules_to_save
PEFT config and then save them in the .safetensors
file, see:
https://github.com/arcee-ai/mergekit/blob/main/mergekit/scripts/extract_lora.py
The only problem I forsee is that the models might need to have a .bias
tensor adding to the compute graph (initialised to zeros) and thus wouldn't work as-is on the stock models downloaded from HF...
The exllamav2
code looks pretty well written though and I doubt adding control vectors via a fake LoRA or otherwise, would be all that hard to do.
It would also get a bit more tricky to add the control vectors via .bias
tensors for MoE models as you'd have to distribute the addend to the MoE MPL layers based on how many experts get used per token (and even consider the "static experts" used in the Qwen MoE models too).
Yes, I believe exllamav2 lets you dynamically add loras.
I was planning to try and create loras with that script you posted in the control vector 0.1 massive thread so I could apply them with a reload of the model.
But a change to exllamav2 to support these directly would be a lot better, that would give us the same flexibility as llamacpp right?
The exllamav2 code looks pretty well written though
Yep! I based my chunking code off what they did to let us quantize huge models on <=24GB GPUs.
P.S. This is the Nihalistic control vector V3.0 version of Mistral-Large:
https://huggingface.co/gghfez/DarkMage-Large-v3-123b-4.5/tree/4.5bpw
Edit:
Ah yes, that's the script you posted earlier: extract_lora.py
It would also get a bit more tricky to add the control vectors via .bias tensors for MoE models as you'd have to distribute the addend to the MoE MPL layers based on how many experts get used per token
Don't the Mixtrals (and hence WizardLM2) both use 2 experts by default (I know there's a setting to change this in exllamav2 but no benefit)? I really want a less talkative Wizard2 8x22b in exl2 lol
lora.py
P.S. turboderp is about to make a change to LoRA to add tensor parallelism (tp in exl2 only came out last week, increasing Mistral-Large inference speed from ~14t/s to ~23t/s)
https://github.com/turboderp/exllamav2/discussions/500#discussioncomment-10530002
This looks a lot simpler way to add control vectors I think?
100% if we can simply add control vectors like that, it would be a lot better!
I assumed it would be too much of a burden for them to maintain a niche feature like this, but seems like he's already added an API for that 'Drugs' thing lol
I'm pretty sure having llamacpp and exllamav2 coverage would be more than enough.
100% if we can simply add control vectors like that, it would be a lot better!
I assumed it would be too much of a burden for them to maintain a niche feature like this, but seems like he's already added an API for that 'Drugs' thing lol
I'm pretty sure having llamacpp and exllamav2 coverage would be more than enough.
Yeah, I would make a comment in the thread and try to get others who use exllamav2
too.
That guy said his code didn't work though, but not sure if it was because of the maths for the projection wasn't working or if the hook wasn't working (he didn't say).
Not sure what I can contribute on the level you guys are discussing, other than asking them to help make it work lol