uukuguy's picture
Adding Evaluation Results (#1)
ec621eb
|
raw
history blame
3.94 kB
metadata
language:
  - en
library_name: peft
pipeline_tag: text-generation
tags:
  - Mistral
license: llama2
model-index:
  - name: SpeechlessCoder
    results:
      - task:
          type: text-generation
        dataset:
          type: openai_humaneval
          name: HumanEval
        metrics:
          - name: pass@1
            type: pass@1
            value: 0
            verified: false

Mistral-7b-OpenOrca-lora

This is a test.

This LoRA model is extracted from the efficient parameter fine-tuned model (Mistral-7B-OpenOra), and now it needs to be verified whether this LoRA model can achieve comparable performance with the original model.

The final goal is to create a toolkit that can simultaneously load multiple LoRA modules, and automatically switch to the appropriate combination of LoRA modules based on user queries to generate the best answer.

The lora merged model is here

The source code is here

Mistral-7B-OpenOrca

Local Test

ARC_acc_norm (25-shot) HellaSwag_acc_norm (10-shot) MMLU_acc (5-shot) TruthfulQA_mc2 (0-shot) GSM8K_acc (8-shot) Open LLM Score
Mistral-7B-OpenOrca 71 83 61.42 45 40 65.11
r=256 68 84 64.28 46.953 41 65.81
r=64 67 84 64.26 47.32 41 65.65
r=16 65 83 62.84 46.95 38 64.45

Open LLM Leaderboard

ARC_acc_norm (25-shot) HellaSwag_acc_norm (10-shot) MMLU_acc (5-shot) TruthfulQA_mc2 (0-shot) Open LLM Score
Mistral-7B-SlimOrca 62.54 83.86 62.77 54.23 65.85
Mistral-7B-OpenOrca 64.08 83.99 62.24 53.05 65.84

lm-evaluation-harness

Open LLM Leaderboard

Metric Mistral-7B-OpenOrca Mistral-7B-OpenOrca-lora Mistral-7B-OpenOrca-lora-merged
ARC 64.08
HellaSwag 83.99
MMLU 62.24
TruthfulQA 53.05
Average 65.84

HumanEval

Metric Mistral-7B-OpenOrca Mistral-7B-OpenOrca-lora Mistral-7B-OpenOrca-lora-merged
humaneval-python 35.976

Training procedure

The following bitsandbytes quantization config was used during training:

  • quant_method: bitsandbytes
  • load_in_8bit: False
  • load_in_4bit: True
  • llm_int8_threshold: 6.0
  • llm_int8_skip_modules: None
  • llm_int8_enable_fp32_cpu_offload: False
  • llm_int8_has_fp16_weight: False
  • bnb_4bit_quant_type: nf4
  • bnb_4bit_use_double_quant: True
  • bnb_4bit_compute_dtype: bfloat16

Framework versions

  • PEFT 0.5.0

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 50.72
ARC (25-shot) 61.95
HellaSwag (10-shot) 83.62
MMLU (5-shot) 64.16
TruthfulQA (0-shot) 42.74
Winogrande (5-shot) 79.08
GSM8K (5-shot) 17.29
DROP (3-shot) 6.19