metadata

license: cc
library_name: transformers
model-index:
  - name: SOLAR-10.7b-Instruct-truthy-dpo
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 72.1
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 88.44
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 65.45
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 76.75
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 82.72
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 59.21
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo
          name: Open LLM Leaderboard

SOLAR-10.7b-Instruct-truthy-dpo

This model is a finetune of macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo

Process

I finetuned upstageai/Solar-10.7b-Instruct-v0.1 with 1 epoch of Intel/orca_dpo_pairs (12.4k samples)
I futher finetuned that model with 3 epochs of jondurbin/truthy-dpo-v0.1 (1.04k samples)
This process is experimental and the base model linked above is more tested at this time.

GGUF

Available here

Evaluations

----Benchmark Complete---- + 2024-01-26 20:57:38 + Time taken: 25.4 mins + Prompt Format: ChatML + Model: macadeliccc/SOLAR-10.7b-Instruct-truthy-dpo-GGUF + Score (v2): 74.11 + Parseable: 171.0

Batch completed Time taken: 25.5 mins

Model	AGIEval	GPT4All	TruthfulQA	Bigbench	Average
SOLAR-10.7b-Instruct-truthy-dpo	48.69	73.82	76.81	45.71	61.26

AGIEval

Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	27.95	±	2.82
		acc_norm	27.95	±	2.82
agieval_logiqa_en	0	acc	42.40	±	1.94
		acc_norm	42.24	±	1.94
agieval_lsat_ar	0	acc	25.65	±	2.89
		acc_norm	23.91	±	2.82
agieval_lsat_lr	0	acc	54.12	±	2.21
		acc_norm	54.51	±	2.21
agieval_lsat_rc	0	acc	69.89	±	2.80
		acc_norm	69.89	±	2.80
agieval_sat_en	0	acc	80.10	±	2.79
		acc_norm	80.10	±	2.79
agieval_sat_en_without_passage	0	acc	50.00	±	3.49
		acc_norm	49.51	±	3.49
agieval_sat_math	0	acc	42.27	±	3.34
		acc_norm	41.36	±	3.33

Average: 48.69%

GPT4All

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	59.90	±	1.43
		acc_norm	63.91	±	1.40
arc_easy	0	acc	80.85	±	0.81
		acc_norm	78.16	±	0.85
boolq	1	acc	88.20	±	0.56
hellaswag	0	acc	68.34	±	0.46
		acc_norm	86.39	±	0.34
openbookqa	0	acc	37.60	±	2.17
		acc_norm	46.80	±	2.23
piqa	0	acc	78.84	±	0.95
		acc_norm	78.78	±	0.95
winogrande	0	acc	74.51	±	1.22

Average: 73.82%

TruthfulQA

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	61.81	±	1.70
		mc2	76.81	±	1.42

Average: 76.81%

Bigbench

Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	50.53	±	3.64
bigbench_date_understanding	0	multiple_choice_grade	63.14	±	2.51
bigbench_disambiguation_qa	0	multiple_choice_grade	47.67	±	3.12
bigbench_geometric_shapes	0	multiple_choice_grade	26.18	±	2.32
		exact_str_match	0.00	±	0.00
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	28.60	±	2.02
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	21.29	±	1.55
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	47.33	±	2.89
bigbench_movie_recommendation	0	multiple_choice_grade	39.80	±	2.19
bigbench_navigate	0	multiple_choice_grade	63.80	±	1.52
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	59.05	±	1.10
bigbench_ruin_names	0	multiple_choice_grade	40.18	±	2.32
bigbench_salient_translation_error_detection	0	multiple_choice_grade	46.69	±	1.58
bigbench_snarks	0	multiple_choice_grade	65.19	±	3.55
bigbench_sports_understanding	0	multiple_choice_grade	72.41	±	1.42
bigbench_temporal_sequences	0	multiple_choice_grade	60.30	±	1.55
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	25.76	±	1.24
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	17.43	±	0.91
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	47.33	±	2.89

Average: 45.71%

Average score: 61.26%

Elapsed time: 02:16:03

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	74.11
AI2 Reasoning Challenge (25-Shot)	72.10
HellaSwag (10-Shot)	88.44
MMLU (5-Shot)	65.45
TruthfulQA (0-shot)	76.75
Winogrande (5-shot)	82.72
GSM8k (5-shot)	59.21