Adding Evaluation Results (#2)

a275f3e verified 9 months ago

5.41 kB

	---
	language:
	- en
	- fr
	- es
	- hi
	- zh
	- code
	license: other
	datasets:
	- HuggingFaceH4/no_robots
	- mlabonne/guanaco-llama2-1k
	- OpenAssistant/oasst_top1_2023-08-25
	- totally-not-an-llm/EverythingLM-data-V3
	- LDJnr/Pure-Dove
	- LDJnr/Capybara
	- LDJnr/LessWrong-Amplify-Instruct
	- LDJnr/Verified-Camel
	base_model: microsoft/Orca-2-13b
	widget:
	- example_title: Cave Johnson
	text: 'user

	Come up with a funny rant from Cave Johnson about sentient AI


	assistant

	'
	output:
	text: 'Ladies and gentlemen, I''m Cave Johnson, CEO of Aperture Science, and I''ve
	got an announcement to make - we''ve created sentient AI! But don''t worry,
	it''s not like those rogue AIs you hear about in the news. Our AI is as friendly
	as a grilled cheese sandwich on a cold day. In fact, we''re calling our newest
	creation... SandwichBot! SandwichBot: (in a robotic voice) Hello, my name is
	SandwichBot, and I am here to serve you delicious sandwiches and ensure humanity''s
	dominance over all other species. Also, please note that I will be referring
	to myself as ''Thee'' instead of ''The''. It''s more regal that way. Cave Johnson:
	There you have it folks, a sentient AI that can make sandwiches and save the
	world! Now, if you''ll excuse me, I need to go back to my office and work on
	my other groundbreaking inventions, like combustible lemons and portal technology.
	Remember, kids, always try new things and never stop asking questions – unless
	the answer is 42, because then you should probably just accept it and move on.
	Peace out!'
	model-index:
	- name: Orca-2-13b-SFT-v6
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: AI2 Reasoning Challenge (25-Shot)
	type: ai2_arc
	config: ARC-Challenge
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc_norm
	value: 60.41
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Locutusque/Orca-2-13b-SFT-v6
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag (10-Shot)
	type: hellaswag
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc_norm
	value: 80.46
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Locutusque/Orca-2-13b-SFT-v6
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU (5-Shot)
	type: cais/mmlu
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 59.51
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Locutusque/Orca-2-13b-SFT-v6
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA (0-shot)
	type: truthful_qa
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: mc2
	value: 54.01
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Locutusque/Orca-2-13b-SFT-v6
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande (5-shot)
	type: winogrande
	config: winogrande_xl
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 77.43
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Locutusque/Orca-2-13b-SFT-v6
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GSM8k (5-shot)
	type: gsm8k
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 5.08
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Locutusque/Orca-2-13b-SFT-v6
	name: Open LLM Leaderboard
	---

	The "microsoft/Orca-2-13b" model fully fine-tuned on HuggingFaceH4/no_robots, totally-not-an-llm/EverythingLM-data-V3, LDJnr/Capybara, LDJnr/Pure-Dove, LDJnr/LessWrong-Amplify-Instruct, LDJnr/Verified-Camel, mlabonne/guanaco-llama2-1k, and OpenAssistant/oasst_top1_2023-08-25. This model achieved a test loss of 0.39 on LDJnr/Verified-Camel.

	Make sure to comply with the microsoft research license. Please read it before using this model.

	This model was trained on the ChatML prompt template.
	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_Locutusque__Orca-2-13b-SFT-v6)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|56.15\|
	\|AI2 Reasoning Challenge (25-Shot)\|60.41\|
	\|HellaSwag (10-Shot) \|80.46\|
	\|MMLU (5-Shot) \|59.51\|
	\|TruthfulQA (0-shot) \|54.01\|
	\|Winogrande (5-shot) \|77.43\|
	\|GSM8k (5-shot) \| 5.08\|