|
--- |
|
license: llama3 |
|
library_name: transformers |
|
tags: |
|
- nsfw |
|
- not-for-all-audiences |
|
- llama-3 |
|
- text-generation-inference |
|
- moe |
|
- mergekit |
|
- merge |
|
--- |
|
|
|
# Llama-Salad-4x8B |
|
This is a MoE merge of several Llama-3 models that aims to create a solid role-play model without sacrificing logic and reasoning. Meta-Llama-3-8B-Instruct was already quite good at role-play, but the overuse of `*wink*` and `*giggle*`, along with the tiptoeing around NSFW topics by spamming ellipses, really started to piss me off. L3-8B-Stheno-v3.1 is definitely the best role-play model out of all of the llama-3 fine-tunes so far; however, it's a little *too* horny for my liking, has a hard time keeping track of what's going on, like all other role-play fine-tunes, and has a serious issue with over-responding; there's a high chance you'll get a 4+ paragraph response to something basic. |
|
|
|
Combining the two models has kept both of their strengths but none of their weaknesses; it completely flattened out its over-responding issue, improved its logic and reasoning, and toned down the horny levels a bit. Don't be mistaken, though; this hasn't affected its ability to generate NSFW content; it's just no longer overwhelmingly horny. It's the first model I've used, aside from base L3-8B-Stheno-v3.1, that actually treats sexual content like it's normal instead of some esoteric and shameful thing. |
|
|
|
While role-play was the main focus of this merge, its base capabilities weren't affected at all, so swapping models for other tasks isn't needed unless you require a bigger model. Actually, with the addition of Tess-2.0-Llama-3-8B, I did find a small overall improvement. There isn't any particular reason Llama3-OpenBioLLM-8B is in the merge; I needed a 4th model for the merge, and it seemed like a decent fine-tune. Upon testing Llama3-OpenBioLLM-8B after the fact, I've come to the conclusion that it's actually quite bad, and if I do make a V2, it will removed. |
|
|
|
Unfortunately, I can't compare it with 70B models because they're too slow on my machine, but this is the best sub-70B model I have used so far; I haven't felt the need to regenerate any responses, which hasn't happened with any other model. This is my first attempt at any kind of merge, and I want to share what I've learned, but this section is already longer than I wanted, so I've decided to place the rest at the bottom of the page. |
|
|
|
# Quantization Formats |
|
**GGUF** |
|
- Static: |
|
- https://huggingface.co/HiroseKoichi/Llama-Salad-4x8B-GGUF |
|
- https://huggingface.co/mradermacher/Llama-Salad-4x8B-GGUF |
|
- Imatrix: |
|
- https://huggingface.co/mradermacher/Llama-Salad-4x8B-i1-GGUF |
|
|
|
# Details |
|
- **License**: [llama3](https://llama.meta.com/llama3/license/) |
|
- **Instruct Format**: [llama-3](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/) |
|
- **Context Size**: 8K |
|
|
|
## Models Used |
|
- [Meta-Llama-3-8B-Instruct](https://huggingface.co/NousResearch/Meta-Llama-3-8B-Instruct) |
|
- [Tess-2.0-Llama-3-8B](https://huggingface.co/migtissera/Tess-2.0-Llama-3-8B) |
|
- [L3-8B-Stheno-v3.1](https://huggingface.co/Sao10K/L3-8B-Stheno-v3.1) |
|
- [Llama3-OpenBioLLM-8B](https://huggingface.co/aaditya/Llama3-OpenBioLLM-8B) |
|
|
|
## Merge Config |
|
```yaml |
|
base_model: NousResearch/Meta-Llama-3-8B-Instruct |
|
gate_mode: hidden |
|
dtype: bfloat16 |
|
experts_per_token: 2 |
|
experts: |
|
- source_model: NousResearch/Meta-Llama-3-8B-Instruct |
|
positive_prompts: |
|
- "summarize" |
|
- "explain" |
|
- "define" |
|
- "translate" |
|
- "multilingual" |
|
- "chat" |
|
- "conversation" |
|
- source_model: migtissera/Tess-2.0-Llama-3-8B |
|
positive_prompts: |
|
- "programming language" |
|
- "math" |
|
- "code" |
|
- "step-by-step" |
|
- "logic" |
|
- source_model: Sao10K/L3-8B-Stheno-v3.1 |
|
positive_prompts: |
|
- "role-play" |
|
- "characters" |
|
- "narration" |
|
- "story writing" |
|
- "scene" |
|
- source_model: aaditya/Llama3-OpenBioLLM-8B |
|
positive_prompts: |
|
- "anatomy" |
|
- "diagnosis" |
|
- "symptom" |
|
- "biomedical" |
|
- "health" |
|
- "medicine" |
|
- "medication" |
|
- "physiology" |
|
``` |
|
|
|
# What I Have Learned |
|
After testing over a hundred different configurations, I have concluded that the name "Mixture of Experts" is misleading for merges; despite the name, using only domain-specific models does not work very well. When you fine-tune a model for a specific task, it degrades its ability to do other tasks; only using models of a specific domain makes that degradation more prominent. Pairing a domain-specific model with a general-purpose model fixes any degradation caused by fine-tuning and allows the model to be used outside of its specific domain. |
|
|
|
As I said up above, my testing of Llama3-OpenBioLLM-8B concluded that it's a terrible model. It doesn't follow instructions, it goes on incoherent rants, and even crashes llama.cpp because it occasionally outputs an invalid token; even using their exact system prompt and example questions only resulted in a barely usable model. The addition of a general-purpose model, Meta-Llama-3-8B-Instruct, has completely healed the damage and resolved the crashes, but using their system prompt and examples still only resulted in barely passable output, and it was impossible to leverage its training without doing so. While the models addition hasn't necessarily improved the MoE merge, it at the very least didn't degrade it. |
|
|
|
When using only domain-specific models in a MoE merge, each model will have degradation from fine-tuning; even if the current task manages to have perfect overlap with two models of the same domain, combining them will not fix that degradation. Likewise, if a task requires two models of different domains at once, it can leverage the strengths of both, but it cannot fix the degradation. It would be impossible to train a set of domain-specific models that cover every single use case in order to fix the degradation across the board, but a general-purpose model can do so easily. |
|
|
|
Mixture of Experts can benefit from a new architecture that allows you to specify one model that will always be chosen no matter what; token routing would still work the same way in order to choose the other expert, but you would now be able to guarantee that it will be paired with a general-purpose model. Llama-Salad-4x8B is proof that this is worthwhile; the addition of a model trained specifically for role-play and a broken model has not degraded its base capabilities and has even fixed the problems with them. |
|
|
|
I'm just some random guy, and these are just my own observations from my experiments, so I very well could be wrong, but it's definitely worth further research considering how well Llama-Salad-4x8B turned out. |