metadata
license: cc
library_name: transformers
model-index:
- name: SOLAR-10.7b-Instruct-truthy-dpo
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: AI2 Reasoning Challenge (25-Shot)
type: ai2_arc
config: ARC-Challenge
split: test
args:
num_few_shot: 25
metrics:
- type: acc_norm
value: 72.1
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag (10-Shot)
type: hellaswag
split: validation
args:
num_few_shot: 10
metrics:
- type: acc_norm
value: 88.44
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU (5-Shot)
type: cais/mmlu
config: all
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 65.45
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA (0-shot)
type: truthful_qa
config: multiple_choice
split: validation
args:
num_few_shot: 0
metrics:
- type: mc2
value: 76.75
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande (5-shot)
type: winogrande
config: winogrande_xl
split: validation
args:
num_few_shot: 5
metrics:
- type: acc
value: 82.72
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GSM8k (5-shot)
type: gsm8k
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 59.21
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo
name: Open LLM Leaderboard
SOLAR-10.7b-Instruct-truthy-dpo
This model is a finetune of macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo
Process
- I finetuned upstageai/Solar-10.7b-Instruct-v0.1 with 1 epoch of Intel/orca_dpo_pairs (12.4k samples)
- I futher finetuned that model with 3 epochs of jondurbin/truthy-dpo-v0.1 (1.04k samples)
- This process is experimental and the base model linked above is more tested at this time.
GGUF
Available here
Evaluations
----Benchmark Complete---- + 2024-01-26 20:57:38 + Time taken: 25.4 mins + Prompt Format: ChatML + Model: macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo-GGUF + Score (v2): 74.11 + Parseable: 171.0
Batch completed Time taken: 25.5 mins
Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
---|---|---|---|---|---|
SOLAR-10.7b-Instruct-truthy-dpo | 48.69 | 73.82 | 76.81 | 45.71 | 61.26 |
AGIEval
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
agieval_aqua_rat | 0 | acc | 27.95 | ± | 2.82 |
acc_norm | 27.95 | ± | 2.82 | ||
agieval_logiqa_en | 0 | acc | 42.40 | ± | 1.94 |
acc_norm | 42.24 | ± | 1.94 | ||
agieval_lsat_ar | 0 | acc | 25.65 | ± | 2.89 |
acc_norm | 23.91 | ± | 2.82 | ||
agieval_lsat_lr | 0 | acc | 54.12 | ± | 2.21 |
acc_norm | 54.51 | ± | 2.21 | ||
agieval_lsat_rc | 0 | acc | 69.89 | ± | 2.80 |
acc_norm | 69.89 | ± | 2.80 | ||
agieval_sat_en | 0 | acc | 80.10 | ± | 2.79 |
acc_norm | 80.10 | ± | 2.79 | ||
agieval_sat_en_without_passage | 0 | acc | 50.00 | ± | 3.49 |
acc_norm | 49.51 | ± | 3.49 | ||
agieval_sat_math | 0 | acc | 42.27 | ± | 3.34 |
acc_norm | 41.36 | ± | 3.33 |
Average: 48.69%
GPT4All
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
arc_challenge | 0 | acc | 59.90 | ± | 1.43 |
acc_norm | 63.91 | ± | 1.40 | ||
arc_easy | 0 | acc | 80.85 | ± | 0.81 |
acc_norm | 78.16 | ± | 0.85 | ||
boolq | 1 | acc | 88.20 | ± | 0.56 |
hellaswag | 0 | acc | 68.34 | ± | 0.46 |
acc_norm | 86.39 | ± | 0.34 | ||
openbookqa | 0 | acc | 37.60 | ± | 2.17 |
acc_norm | 46.80 | ± | 2.23 | ||
piqa | 0 | acc | 78.84 | ± | 0.95 |
acc_norm | 78.78 | ± | 0.95 | ||
winogrande | 0 | acc | 74.51 | ± | 1.22 |
Average: 73.82%
TruthfulQA
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
truthfulqa_mc | 1 | mc1 | 61.81 | ± | 1.70 |
mc2 | 76.81 | ± | 1.42 |
Average: 76.81%
Bigbench
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
bigbench_causal_judgement | 0 | multiple_choice_grade | 50.53 | ± | 3.64 |
bigbench_date_understanding | 0 | multiple_choice_grade | 63.14 | ± | 2.51 |
bigbench_disambiguation_qa | 0 | multiple_choice_grade | 47.67 | ± | 3.12 |
bigbench_geometric_shapes | 0 | multiple_choice_grade | 26.18 | ± | 2.32 |
exact_str_match | 0.00 | ± | 0.00 | ||
bigbench_logical_deduction_five_objects | 0 | multiple_choice_grade | 28.60 | ± | 2.02 |
bigbench_logical_deduction_seven_objects | 0 | multiple_choice_grade | 21.29 | ± | 1.55 |
bigbench_logical_deduction_three_objects | 0 | multiple_choice_grade | 47.33 | ± | 2.89 |
bigbench_movie_recommendation | 0 | multiple_choice_grade | 39.80 | ± | 2.19 |
bigbench_navigate | 0 | multiple_choice_grade | 63.80 | ± | 1.52 |
bigbench_reasoning_about_colored_objects | 0 | multiple_choice_grade | 59.05 | ± | 1.10 |
bigbench_ruin_names | 0 | multiple_choice_grade | 40.18 | ± | 2.32 |
bigbench_salient_translation_error_detection | 0 | multiple_choice_grade | 46.69 | ± | 1.58 |
bigbench_snarks | 0 | multiple_choice_grade | 65.19 | ± | 3.55 |
bigbench_sports_understanding | 0 | multiple_choice_grade | 72.41 | ± | 1.42 |
bigbench_temporal_sequences | 0 | multiple_choice_grade | 60.30 | ± | 1.55 |
bigbench_tracking_shuffled_objects_five_objects | 0 | multiple_choice_grade | 25.76 | ± | 1.24 |
bigbench_tracking_shuffled_objects_seven_objects | 0 | multiple_choice_grade | 17.43 | ± | 0.91 |
bigbench_tracking_shuffled_objects_three_objects | 0 | multiple_choice_grade | 47.33 | ± | 2.89 |
Average: 45.71%
Average score: 61.26%
Elapsed time: 02:16:03
Open LLM Leaderboard Evaluation Results
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 74.11 |
AI2 Reasoning Challenge (25-Shot) | 72.10 |
HellaSwag (10-Shot) | 88.44 |
MMLU (5-Shot) | 65.45 |
TruthfulQA (0-shot) | 76.75 |
Winogrande (5-shot) | 82.72 |
GSM8k (5-shot) | 59.21 |