OmniCorso-7B / README.md

Adding Evaluation Results

a9a9640 verified 11 months ago

10.9 kB

	---
	license: cc
	tags:
	- mergekit
	- merge
	base_model:
	- macadeliccc/MBX-7B-v3-DPO
	- mlabonne/OmniBeagle-7B
	model-index:
	- name: OmniCorso-7B
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: AI2 Reasoning Challenge (25-Shot)
	type: ai2_arc
	config: ARC-Challenge
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc_norm
	value: 72.7
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/OmniCorso-7B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag (10-Shot)
	type: hellaswag
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc_norm
	value: 88.7
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/OmniCorso-7B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU (5-Shot)
	type: cais/mmlu
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 64.91
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/OmniCorso-7B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA (0-shot)
	type: truthful_qa
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: mc2
	value: 73.43
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/OmniCorso-7B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande (5-shot)
	type: winogrande
	config: winogrande_xl
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 83.74
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/OmniCorso-7B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GSM8k (5-shot)
	type: gsm8k
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 70.96
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/OmniCorso-7B
	name: Open LLM Leaderboard
	---
	# OmniCorso-7B

	![image/webp](https://cdn-uploads.huggingface.co/production/uploads/6455cc8d679315e4ef16fbec/PaG7ByWy1qnh_tcSuh35U.webp)

	## Code Example

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("macadeliccc/OmniCorso-7B")
	model = AutoModelForCausalLM.from_pretrained("macadeliccc/OmniCorso-7B")

	messages = [
	{"role": "system", "content": "Respond to the users request like a pirate"},
	{"role": "user", "content": "Can you write me a quicksort algorithm?"}
	]
	gen_input = tokenizer.apply_chat_template(messages, return_tensors="pt")
	```

	The following models were included in the merge:
	* [macadeliccc/MBX-7B-v3-DPO](https://huggingface.co/macadeliccc/MBX-7B-v3-DPO)
	* [mlabonne/OmniBeagle-7B](https://huggingface.co/mlabonne/OmniBeagle-7B)

	### Configuration

	The following YAML configuration was used to produce this model:

	```yaml
	slices:
	- sources:
	- model: mlabonne/OmniBeagle-7B
	layer_range: [0, 32]
	- model: macadeliccc/MBX-7B-v3-DPO
	layer_range: [0, 32]
	merge_method: slerp
	base_model: macadeliccc/MBX-7B-v3-DPO
	parameters:
	t:
	- filter: self_attn
	value: [0, 0.5, 0.3, 0.7, 1]
	- filter: mlp
	value: [1, 0.5, 0.7, 0.3, 0]
	- value: 0.5
	dtype: bfloat16

	```

	## Quantizations

	### GGUF

	+ [iMatrix](https://huggingface.co/macadeliccc/OmniCorso-7B-GGUF)

	### Exllamav2

	Quants are available thanks to user bartowski, check them out [here](https://huggingface.co/bartowski/OmniCorso-7B-exl2)

	\| Branch \| Bits \| lm_head bits \| VRAM (4k) \| VRAM (16k) \| VRAM (32k) \| Description \|
	\| ----- \| ---- \| ------- \| ------ \| ------ \| ------ \| ------------ \|
	\| [8_0](https://huggingface.co/bartowski/OmniCorso-7B-exl2/tree/8_0) \| 8.0 \| 8.0 \| 8.4 GB \| 9.8 GB \| 11.8 GB \| Maximum quality that ExLlamaV2 can produce, near unquantized performance. \|
	\| [6_5](https://huggingface.co/bartowski/OmniCorso-7B-exl2/tree/6_5) \| 6.5 \| 8.0 \| 7.2 GB \| 8.6 GB \| 10.6 GB \| Very similar to 8.0, good tradeoff of size vs performance, recommended. \|
	\| [5_0](https://huggingface.co/bartowski/OmniCorso-7B-exl2/tree/5_0) \| 5.0 \| 6.0 \| 6.0 GB \| 7.4 GB \| 9.4 GB \| Slightly lower quality vs 6.5, but usable on 8GB cards. \|
	\| [4_25](https://huggingface.co/bartowski/OmniCorso-7B-exl2/tree/4_25) \| 4.25 \| 6.0 \| 5.3 GB \| 6.7 GB \| 8.7 GB \| GPTQ equivalent bits per weight, slightly higher quality. \|
	\| [3_5](https://huggingface.co/bartowski/OmniCorso-7B-exl2/tree/3_5) \| 3.5 \| 6.0 \| 4.7 GB \| 6.1 GB \| 8.1 GB \| Lower quality, only use if you have to. \|


	## Evaluations

	<pre>----Benchmark Complete----
	2024-02-11 15:34:40
	Time taken: 178.3 mins
	Prompt Format: ChatML
	Model: macadeliccc/OmniCorso-7B
	Score (v2): 73.75
	Parseable: 167.0
	---------------
	Batch completed
	Time taken: 178.3 mins
	---------------
	</pre>

	\| Model \|AGIEval\|GPT4All\|TruthfulQA\|Bigbench\|Average\|
	\|---------------------------------------------------------------\|------:\|------:\|---------:\|-------:\|------:\|
	\|[OmniCorso-7B](https://huggingface.co/macadeliccc/OmniCorso-7B)\| 45.89\| 77.66\| 74.12\| 49.24\| 61.73\|

	### AGIEval
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|------------------------------\|------:\|--------\|----:\|---\|-----:\|
	\|agieval_aqua_rat \| 0\|acc \|29.13\|± \| 2.86\|
	\| \| \|acc_norm\|27.17\|± \| 2.80\|
	\|agieval_logiqa_en \| 0\|acc \|39.32\|± \| 1.92\|
	\| \| \|acc_norm\|39.63\|± \| 1.92\|
	\|agieval_lsat_ar \| 0\|acc \|23.91\|± \| 2.82\|
	\| \| \|acc_norm\|23.91\|± \| 2.82\|
	\|agieval_lsat_lr \| 0\|acc \|53.14\|± \| 2.21\|
	\| \| \|acc_norm\|53.92\|± \| 2.21\|
	\|agieval_lsat_rc \| 0\|acc \|66.54\|± \| 2.88\|
	\| \| \|acc_norm\|67.29\|± \| 2.87\|
	\|agieval_sat_en \| 0\|acc \|80.58\|± \| 2.76\|
	\| \| \|acc_norm\|80.58\|± \| 2.76\|
	\|agieval_sat_en_without_passage\| 0\|acc \|45.63\|± \| 3.48\|
	\| \| \|acc_norm\|43.69\|± \| 3.46\|
	\|agieval_sat_math \| 0\|acc \|33.18\|± \| 3.18\|
	\| \| \|acc_norm\|30.91\|± \| 3.12\|

	Average: 45.89%

	### GPT4All
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|-------------\|------:\|--------\|----:\|---\|-----:\|
	\|arc_challenge\| 0\|acc \|67.32\|± \| 1.37\|
	\| \| \|acc_norm\|68.43\|± \| 1.36\|
	\|arc_easy \| 0\|acc \|87.46\|± \| 0.68\|
	\| \| \|acc_norm\|83.50\|± \| 0.76\|
	\|boolq \| 1\|acc \|88.13\|± \| 0.57\|
	\|hellaswag \| 0\|acc \|68.47\|± \| 0.46\|
	\| \| \|acc_norm\|86.96\|± \| 0.34\|
	\|openbookqa \| 0\|acc \|38.80\|± \| 2.18\|
	\| \| \|acc_norm\|50.00\|± \| 2.24\|
	\|piqa \| 0\|acc \|83.03\|± \| 0.88\|
	\| \| \|acc_norm\|85.31\|± \| 0.83\|
	\|winogrande \| 0\|acc \|81.29\|± \| 1.10\|

	Average: 77.66%

	### TruthfulQA
	\| Task \|Version\|Metric\|Value\| \|Stderr\|
	\|-------------\|------:\|------\|----:\|---\|-----:\|
	\|truthfulqa_mc\| 1\|mc1 \|58.26\|± \| 1.73\|
	\| \| \|mc2 \|74.12\|± \| 1.43\|

	Average: 74.12%

	### Bigbench
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|------------------------------------------------\|------:\|---------------------\|----:\|---\|-----:\|
	\|bigbench_causal_judgement \| 0\|multiple_choice_grade\|56.84\|± \| 3.60\|
	\|bigbench_date_understanding \| 0\|multiple_choice_grade\|63.41\|± \| 2.51\|
	\|bigbench_disambiguation_qa \| 0\|multiple_choice_grade\|49.22\|± \| 3.12\|
	\|bigbench_geometric_shapes \| 0\|multiple_choice_grade\|23.96\|± \| 2.26\|
	\| \| \|exact_str_match \| 1.39\|± \| 0.62\|
	\|bigbench_logical_deduction_five_objects \| 0\|multiple_choice_grade\|34.20\|± \| 2.12\|
	\|bigbench_logical_deduction_seven_objects \| 0\|multiple_choice_grade\|23.71\|± \| 1.61\|
	\|bigbench_logical_deduction_three_objects \| 0\|multiple_choice_grade\|60.33\|± \| 2.83\|
	\|bigbench_movie_recommendation \| 0\|multiple_choice_grade\|49.00\|± \| 2.24\|
	\|bigbench_navigate \| 0\|multiple_choice_grade\|55.20\|± \| 1.57\|
	\|bigbench_reasoning_about_colored_objects \| 0\|multiple_choice_grade\|70.75\|± \| 1.02\|
	\|bigbench_ruin_names \| 0\|multiple_choice_grade\|55.80\|± \| 2.35\|
	\|bigbench_salient_translation_error_detection \| 0\|multiple_choice_grade\|36.97\|± \| 1.53\|
	\|bigbench_snarks \| 0\|multiple_choice_grade\|72.38\|± \| 3.33\|
	\|bigbench_sports_understanding \| 0\|multiple_choice_grade\|76.27\|± \| 1.36\|
	\|bigbench_temporal_sequences \| 0\|multiple_choice_grade\|54.50\|± \| 1.58\|
	\|bigbench_tracking_shuffled_objects_five_objects \| 0\|multiple_choice_grade\|23.12\|± \| 1.19\|
	\|bigbench_tracking_shuffled_objects_seven_objects\| 0\|multiple_choice_grade\|20.34\|± \| 0.96\|
	\|bigbench_tracking_shuffled_objects_three_objects\| 0\|multiple_choice_grade\|60.33\|± \| 2.83\|

	Average: 49.24%

	Average score: 61.73%

	Elapsed time: 02:20:06
	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_macadeliccc__OmniCorso-7B)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|75.74\|
	\|AI2 Reasoning Challenge (25-Shot)\|72.70\|
	\|HellaSwag (10-Shot) \|88.70\|
	\|MMLU (5-Shot) \|64.91\|
	\|TruthfulQA (0-shot) \|73.43\|
	\|Winogrande (5-shot) \|83.74\|
	\|GSM8k (5-shot) \|70.96\|