license: other
tags:
- moe
- merge
- mergekit
- Mistral
- openchat/openchat-3.5-1210
- beowolx/CodeNinja-1.0-OpenChat-7B
- maywell/PiVoT-0.1-Starling-LM-RP
- WizardLM/WizardMath-7B-V1.1
license_name: microsoft-research-license
license_link: https://huggingface.co/WizardLM/WizardMath-7B-V1.1/resolve/main/LICENSE
model-index:
- name: Beyonder-4x7B-v2
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: AI2 Reasoning Challenge (25-Shot)
type: ai2_arc
config: ARC-Challenge
split: test
args:
num_few_shot: 25
metrics:
- type: acc_norm
value: 68.77
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mlabonne/Beyonder-4x7B-v2
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag (10-Shot)
type: hellaswag
split: validation
args:
num_few_shot: 10
metrics:
- type: acc_norm
value: 86.8
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mlabonne/Beyonder-4x7B-v2
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU (5-Shot)
type: cais/mmlu
config: all
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 65.1
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mlabonne/Beyonder-4x7B-v2
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA (0-shot)
type: truthful_qa
config: multiple_choice
split: validation
args:
num_few_shot: 0
metrics:
- type: mc2
value: 60.68
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mlabonne/Beyonder-4x7B-v2
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande (5-shot)
type: winogrande
config: winogrande_xl
split: validation
args:
num_few_shot: 5
metrics:
- type: acc
value: 80.9
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mlabonne/Beyonder-4x7B-v2
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GSM8k (5-shot)
type: gsm8k
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 71.72
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mlabonne/Beyonder-4x7B-v2
name: Open LLM Leaderboard
Beyonder-4x7B-v2
This model is a Mixture of Experts (MoE) made with mergekit (mixtral branch). It uses the following base models:
- openchat/openchat-3.5-1210
- beowolx/CodeNinja-1.0-OpenChat-7B
- maywell/PiVoT-0.1-Starling-LM-RP
- WizardLM/WizardMath-7B-V1.1
The recommended context length is 8k.
⚡ Quantized models
Thanks to TheBloke for the quantized models:
- GGUF: https://huggingface.co/TheBloke/Beyonder-4x7B-v2-GGUF
- AWQ: https://huggingface.co/TheBloke/Beyonder-4x7B-v2-AWQ
- GPTQ: https://huggingface.co/TheBloke/Beyonder-4x7B-v2-GPTQ
- EXL2: https://huggingface.co/bartowski/Beyonder-4x7B-v2-exl2
🏆 Evaluation
Beyonder-4x7B-v2 is competitive with Mixtral-8x7B-Instruct-v0.1 on the Open LLM Leaderboard, while only having 4 experts instead of 8.
It also displays a significant improvement over the individual experts.
It also performs very well compared to other models on Nous benchmark suite. It's almost as good as the best Yi-34B fine-tune, which is a much bigger model: 24.2B parameters + only two experts are selected during inference (so ~12B) vs. 34B param.
Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
---|---|---|---|---|---|
Beyonder-4x7B-v2 | 45.29 | 75.95 | 60.86 | 46.4 | 57.13 |
NeuralHermes-2.5-Mistral-7B | 43.67 | 73.24 | 55.37 | 41.76 | 53.51 |
OpenHermes-2.5-Mistral-7B | 42.75 | 72.99 | 52.99 | 40.94 | 52.42 |
Nous-Hermes-2-SOLAR-10.7B | 47.79 | 74.69 | 55.92 | 44.84 | 55.81 |
Nous-Hermes-2-Yi-34B | 50.27 | 76.00 | 60.34 | 46.69 | 58.33 |
AGIEval
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
agieval_aqua_rat | 0 | acc | 23.62 | ± | 2.67 |
acc_norm | 23.62 | ± | 2.67 | ||
agieval_logiqa_en | 0 | acc | 41.47 | ± | 1.93 |
acc_norm | 43.01 | ± | 1.94 | ||
agieval_lsat_ar | 0 | acc | 23.04 | ± | 2.78 |
acc_norm | 23.48 | ± | 2.80 | ||
agieval_lsat_lr | 0 | acc | 51.57 | ± | 2.22 |
acc_norm | 52.94 | ± | 2.21 | ||
agieval_lsat_rc | 0 | acc | 64.31 | ± | 2.93 |
acc_norm | 64.68 | ± | 2.92 | ||
agieval_sat_en | 0 | acc | 79.13 | ± | 2.84 |
acc_norm | 79.13 | ± | 2.84 | ||
agieval_sat_en_without_passage | 0 | acc | 43.20 | ± | 3.46 |
acc_norm | 43.20 | ± | 3.46 | ||
agieval_sat_math | 0 | acc | 34.55 | ± | 3.21 |
acc_norm | 32.27 | ± | 3.16 |
GPT4All
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
arc_challenge | 0 | acc | 61.86 | ± | 1.42 |
acc_norm | 64.51 | ± | 1.40 | ||
arc_easy | 0 | acc | 85.06 | ± | 0.73 |
acc_norm | 82.45 | ± | 0.78 | ||
boolq | 1 | acc | 88.35 | ± | 0.56 |
hellaswag | 0 | acc | 68.04 | ± | 0.47 |
acc_norm | 85.12 | ± | 0.36 | ||
openbookqa | 0 | acc | 37.80 | ± | 2.17 |
acc_norm | 48.60 | ± | 2.24 | ||
piqa | 0 | acc | 83.08 | ± | 0.87 |
acc_norm | 83.95 | ± | 0.86 | ||
winogrande | 0 | acc | 78.69 | ± | 1.15 |
TruthfulQA
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
truthfulqa_mc | 1 | mc1 | 44.55 | ± | 1.74 |
mc2 | 60.86 | ± | 1.57 |
Bigbench
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
bigbench_causal_judgement | 0 | multiple_choice_grade | 58.95 | ± | 3.58 |
bigbench_date_understanding | 0 | multiple_choice_grade | 66.40 | ± | 2.46 |
bigbench_disambiguation_qa | 0 | multiple_choice_grade | 48.84 | ± | 3.12 |
bigbench_geometric_shapes | 0 | multiple_choice_grade | 22.56 | ± | 2.21 |
exact_str_match | 13.37 | ± | 1.80 | ||
bigbench_logical_deduction_five_objects | 0 | multiple_choice_grade | 30.40 | ± | 2.06 |
bigbench_logical_deduction_seven_objects | 0 | multiple_choice_grade | 20.57 | ± | 1.53 |
bigbench_logical_deduction_three_objects | 0 | multiple_choice_grade | 52.00 | ± | 2.89 |
bigbench_movie_recommendation | 0 | multiple_choice_grade | 44.40 | ± | 2.22 |
bigbench_navigate | 0 | multiple_choice_grade | 52.10 | ± | 1.58 |
bigbench_reasoning_about_colored_objects | 0 | multiple_choice_grade | 69.75 | ± | 1.03 |
bigbench_ruin_names | 0 | multiple_choice_grade | 55.36 | ± | 2.35 |
bigbench_salient_translation_error_detection | 0 | multiple_choice_grade | 23.65 | ± | 1.35 |
bigbench_snarks | 0 | multiple_choice_grade | 77.35 | ± | 3.12 |
bigbench_sports_understanding | 0 | multiple_choice_grade | 73.02 | ± | 1.41 |
bigbench_temporal_sequences | 0 | multiple_choice_grade | 46.80 | ± | 1.58 |
bigbench_tracking_shuffled_objects_five_objects | 0 | multiple_choice_grade | 22.08 | ± | 1.17 |
bigbench_tracking_shuffled_objects_seven_objects | 0 | multiple_choice_grade | 19.03 | ± | 0.94 |
bigbench_tracking_shuffled_objects_three_objects | 0 | multiple_choice_grade | 52.00 | ± | 2.89 |
🧩 Configuration
base_model: mlabonne/Marcoro14-7B-slerp
experts:
- source_model: openchat/openchat-3.5-1210
positive_prompts:
- "chat"
- "assistant"
- "tell me"
- "explain"
- source_model: beowolx/CodeNinja-1.0-OpenChat-7B
positive_prompts:
- "code"
- "python"
- "javascript"
- "programming"
- "algorithm"
- source_model: maywell/PiVoT-0.1-Starling-LM-RP
positive_prompts:
- "storywriting"
- "write"
- "scene"
- "story"
- "character"
- source_model: WizardLM/WizardMath-7B-V1.1
positive_prompts:
- "reason"
- "math"
- "mathematics"
- "solve"
- "count"
💻 Usage
Here's a notebook to run this model in 4-bit precision using a free T4 GPU on Google Colab.
!pip install -qU transformers bitsandbytes accelerate
from transformers import AutoTokenizer
import transformers
import torch
model = "mlabonne/Beyonder-4x7B-v2"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
model_kwargs={"torch_dtype": torch.float16, "load_in_4bit": True},
)
messages = [{"role": "user", "content": "Explain what a Mixture of Experts is in less than 100 words."}]
prompt = pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
Output:
A Mixture of Experts (ME) is a machine learning technique that combines multiple expert models to make predictions or decisions. Each expert model is specialized in a different aspect of the problem, and their outputs are combined to produce a more accurate and robust solution. This approach allows the model to leverage the strengths of individual experts and compensate for their weaknesses, improving overall performance.
Open LLM Leaderboard Evaluation Results
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 72.33 |
AI2 Reasoning Challenge (25-Shot) | 68.77 |
HellaSwag (10-Shot) | 86.80 |
MMLU (5-Shot) | 65.10 |
TruthfulQA (0-shot) | 60.68 |
Winogrande (5-shot) | 80.90 |
GSM8k (5-shot) | 71.72 |