Edit model card

SOLAR-10.7b-Instruct-dpo

orca-header

This model is a finetune of upstage/SOLAR-10.7B-Instruct-v1.0 using Intel/orca_dpo_pairs

Chat Template

This model follows the chatML chat template.

Evaluations

EQ Bench comparison with base model

These scores are the average of 3 iterations.

----Benchmark Complete---- + 2024-01-25 04:41:01 + Time taken: 236.1 mins + Prompt Format: ChatML + Model: macadeliccc/SOLAR-10.7b-Instruct-dpo + Score (v2): 72.79 + Parseable: 165.67

Batch completed Time taken: 236.1 mins

as compared to the original model:

----Benchmark Complete---- + 2024-01-25 08:45:02 + Time taken: 244.0 mins + Prompt Format: ChatML + Model: upstage/SOLAR-10.7B-Instruct-v1.0 + Score (v2): 71.03 + Parseable: 165.67

Batch completed Time taken: 480.1 mins

Model AGIEval GPT4All TruthfulQA Bigbench Average
SOLAR-10.7b-Instruct-dpo 47.57 74.3 72.73 45.76 60.09

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 27.56 Β± 2.81
acc_norm 26.77 Β± 2.78
agieval_logiqa_en 0 acc 41.63 Β± 1.93
acc_norm 41.32 Β± 1.93
agieval_lsat_ar 0 acc 25.22 Β± 2.87
acc_norm 24.35 Β± 2.84
agieval_lsat_lr 0 acc 54.12 Β± 2.21
acc_norm 54.31 Β± 2.21
agieval_lsat_rc 0 acc 68.77 Β± 2.83
acc_norm 69.14 Β± 2.82
agieval_sat_en 0 acc 79.13 Β± 2.84
acc_norm 79.13 Β± 2.84
agieval_sat_en_without_passage 0 acc 44.66 Β± 3.47
acc_norm 44.66 Β± 3.47
agieval_sat_math 0 acc 40.45 Β± 3.32
acc_norm 40.91 Β± 3.32

Average: 47.57%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 60.49 Β± 1.43
acc_norm 63.74 Β± 1.40
arc_easy 0 acc 82.07 Β± 0.79
acc_norm 79.92 Β± 0.82
boolq 1 acc 88.56 Β± 0.56
hellaswag 0 acc 68.47 Β± 0.46
acc_norm 86.06 Β± 0.35
openbookqa 0 acc 36.20 Β± 2.15
acc_norm 46.60 Β± 2.23
piqa 0 acc 79.38 Β± 0.94
acc_norm 79.71 Β± 0.94
winogrande 0 acc 75.53 Β± 1.21

Average: 74.3%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 57.77 Β± 1.73
mc2 72.73 Β± 1.49

Average: 72.73%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 55.26 Β± 3.62
bigbench_date_understanding 0 multiple_choice_grade 62.87 Β± 2.52
bigbench_disambiguation_qa 0 multiple_choice_grade 46.51 Β± 3.11
bigbench_geometric_shapes 0 multiple_choice_grade 25.63 Β± 2.31
exact_str_match 0.00 Β± 0.00
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 28.00 Β± 2.01
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 20.57 Β± 1.53
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 46.67 Β± 2.89
bigbench_movie_recommendation 0 multiple_choice_grade 41.80 Β± 2.21
bigbench_navigate 0 multiple_choice_grade 64.00 Β± 1.52
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 60.00 Β± 1.10
bigbench_ruin_names 0 multiple_choice_grade 39.96 Β± 2.32
bigbench_salient_translation_error_detection 0 multiple_choice_grade 47.90 Β± 1.58
bigbench_snarks 0 multiple_choice_grade 64.09 Β± 3.58
bigbench_sports_understanding 0 multiple_choice_grade 71.10 Β± 1.44
bigbench_temporal_sequences 0 multiple_choice_grade 59.90 Β± 1.55
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 24.96 Β± 1.22
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 17.89 Β± 0.92
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 46.67 Β± 2.89

Average: 45.76%

Average score: 60.09%

Elapsed time: 02:10:16

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 73.54
AI2 Reasoning Challenge (25-Shot) 71.76
HellaSwag (10-Shot) 88.08
MMLU (5-Shot) 66.06
TruthfulQA (0-shot) 71.98
Winogrande (5-shot) 82.32
GSM8k (5-shot) 61.03
Downloads last month
86
Safetensors
Model size
10.7B params
Tensor type
FP16
Β·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for macadeliccc/SOLAR-10.7b-Instruct-dpo

Quantizations
1 model

Spaces using macadeliccc/SOLAR-10.7b-Instruct-dpo 5

Collection including macadeliccc/SOLAR-10.7b-Instruct-dpo

Evaluation results