leaderboard-pr-bot commited on
Commit
27448dc
1 Parent(s): 452d74c

Adding Evaluation Results

Browse files

This is an automated PR created with https://huggingface.co/spaces/Weyaxi/open-llm-leaderboard-results-pr

The purpose of this PR is to add evaluation results from the Open LLM Leaderboard to your model card.

If you encounter any issues, please report them to https://huggingface.co/spaces/Weyaxi/open-llm-leaderboard-results-pr/discussions

Files changed (1) hide show
  1. README.md +158 -58
README.md CHANGED
@@ -1,64 +1,151 @@
1
  ---
2
  language:
3
- - en
4
- - fr
5
- - ro
6
- - de
7
- - multilingual
8
- widget:
9
- - text: 'Translate to German: My name is Arthur'
10
- example_title: Translation
11
- - text: >-
12
- Please answer to the following question. Who is going to be the next
13
- Ballon d'or?
14
- example_title: Question Answering
15
- - text: >-
16
- Q: Can Geoffrey Hinton have a conversation with George Washington? Give
17
- the rationale before answering.
18
- example_title: Logical reasoning
19
- - text: >-
20
- Please answer the following question. What is the boiling point of
21
- Nitrogen?
22
- example_title: Scientific knowledge
23
- - text: >-
24
- Answer the following yes/no question. Can you write a whole Haiku in a
25
- single tweet?
26
- example_title: Yes/no question
27
- - text: >-
28
- Answer the following yes/no question by reasoning step-by-step. Can you
29
- write a whole Haiku in a single tweet?
30
- example_title: Reasoning task
31
- - text: 'Q: ( False or not False or False ) is? A: Let''s think step by step'
32
- example_title: Boolean Expressions
33
- - text: >-
34
- The square root of x is the cube root of y. What is y to the power of 2,
35
- if x = 4?
36
- example_title: Math reasoning
37
- - text: >-
38
- Premise: At my age you will probably have learnt one lesson. Hypothesis:
39
- It's not certain how many lessons you'll learn by your thirties. Does the
40
- premise entail the hypothesis?
41
- example_title: Premise and hypothesis
42
- - text: >-
43
- Answer the following question by reasoning step by step.
44
- The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apple do they have?
45
- example_title: Chain of thought
46
  tags:
47
- - text2text-generation
48
- - flan-ul2
49
  datasets:
50
- - svakulenk0/qrecc
51
- - taskmaster2
52
- - djaym7/wiki_dialog
53
- - deepmind/code_contests
54
- - lambada
55
- - gsm8k
56
- - aqua_rat
57
- - esnli
58
- - quasc
59
- - qed
60
- - c4
61
- license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ---
63
 
64
 
@@ -244,4 +331,17 @@ This model was originally contributed by [Yi Tay](https://www.yitay.net/?author=
244
 
245
  # Citation
246
 
247
- If you want to cite this work, please consider citing the [blogpost](https://www.yitay.net/blog/flan-ul2-20b) announcing the release of `Flan-UL2`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  language:
3
+ - en
4
+ - fr
5
+ - ro
6
+ - de
7
+ - multilingual
8
+ license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  tags:
10
+ - text2text-generation
11
+ - flan-ul2
12
  datasets:
13
+ - svakulenk0/qrecc
14
+ - taskmaster2
15
+ - djaym7/wiki_dialog
16
+ - deepmind/code_contests
17
+ - lambada
18
+ - gsm8k
19
+ - aqua_rat
20
+ - esnli
21
+ - quasc
22
+ - qed
23
+ - c4
24
+ widget:
25
+ - text: 'Translate to German: My name is Arthur'
26
+ example_title: Translation
27
+ - text: Please answer to the following question. Who is going to be the next Ballon
28
+ d'or?
29
+ example_title: Question Answering
30
+ - text: 'Q: Can Geoffrey Hinton have a conversation with George Washington? Give the
31
+ rationale before answering.'
32
+ example_title: Logical reasoning
33
+ - text: Please answer the following question. What is the boiling point of Nitrogen?
34
+ example_title: Scientific knowledge
35
+ - text: Answer the following yes/no question. Can you write a whole Haiku in a single
36
+ tweet?
37
+ example_title: Yes/no question
38
+ - text: Answer the following yes/no question by reasoning step-by-step. Can you write
39
+ a whole Haiku in a single tweet?
40
+ example_title: Reasoning task
41
+ - text: 'Q: ( False or not False or False ) is? A: Let''s think step by step'
42
+ example_title: Boolean Expressions
43
+ - text: The square root of x is the cube root of y. What is y to the power of 2, if
44
+ x = 4?
45
+ example_title: Math reasoning
46
+ - text: 'Premise: At my age you will probably have learnt one lesson. Hypothesis: It''s
47
+ not certain how many lessons you''ll learn by your thirties. Does the premise
48
+ entail the hypothesis?'
49
+ example_title: Premise and hypothesis
50
+ - text: Answer the following question by reasoning step by step. The cafeteria had
51
+ 23 apples. If they used 20 for lunch, and bought 6 more, how many apple do they
52
+ have?
53
+ example_title: Chain of thought
54
+ model-index:
55
+ - name: flan-ul2
56
+ results:
57
+ - task:
58
+ type: text-generation
59
+ name: Text Generation
60
+ dataset:
61
+ name: IFEval (0-Shot)
62
+ type: HuggingFaceH4/ifeval
63
+ args:
64
+ num_few_shot: 0
65
+ metrics:
66
+ - type: inst_level_strict_acc and prompt_level_strict_acc
67
+ value: 23.93
68
+ name: strict accuracy
69
+ source:
70
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=google/flan-ul2
71
+ name: Open LLM Leaderboard
72
+ - task:
73
+ type: text-generation
74
+ name: Text Generation
75
+ dataset:
76
+ name: BBH (3-Shot)
77
+ type: BBH
78
+ args:
79
+ num_few_shot: 3
80
+ metrics:
81
+ - type: acc_norm
82
+ value: 30.02
83
+ name: normalized accuracy
84
+ source:
85
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=google/flan-ul2
86
+ name: Open LLM Leaderboard
87
+ - task:
88
+ type: text-generation
89
+ name: Text Generation
90
+ dataset:
91
+ name: MATH Lvl 5 (4-Shot)
92
+ type: hendrycks/competition_math
93
+ args:
94
+ num_few_shot: 4
95
+ metrics:
96
+ - type: exact_match
97
+ value: 0.15
98
+ name: exact match
99
+ source:
100
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=google/flan-ul2
101
+ name: Open LLM Leaderboard
102
+ - task:
103
+ type: text-generation
104
+ name: Text Generation
105
+ dataset:
106
+ name: GPQA (0-shot)
107
+ type: Idavidrein/gpqa
108
+ args:
109
+ num_few_shot: 0
110
+ metrics:
111
+ - type: acc_norm
112
+ value: 5.03
113
+ name: acc_norm
114
+ source:
115
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=google/flan-ul2
116
+ name: Open LLM Leaderboard
117
+ - task:
118
+ type: text-generation
119
+ name: Text Generation
120
+ dataset:
121
+ name: MuSR (0-shot)
122
+ type: TAUR-Lab/MuSR
123
+ args:
124
+ num_few_shot: 0
125
+ metrics:
126
+ - type: acc_norm
127
+ value: 5.58
128
+ name: acc_norm
129
+ source:
130
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=google/flan-ul2
131
+ name: Open LLM Leaderboard
132
+ - task:
133
+ type: text-generation
134
+ name: Text Generation
135
+ dataset:
136
+ name: MMLU-PRO (5-shot)
137
+ type: TIGER-Lab/MMLU-Pro
138
+ config: main
139
+ split: test
140
+ args:
141
+ num_few_shot: 5
142
+ metrics:
143
+ - type: acc
144
+ value: 16.59
145
+ name: accuracy
146
+ source:
147
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=google/flan-ul2
148
+ name: Open LLM Leaderboard
149
  ---
150
 
151
 
 
331
 
332
  # Citation
333
 
334
+ If you want to cite this work, please consider citing the [blogpost](https://www.yitay.net/blog/flan-ul2-20b) announcing the release of `Flan-UL2`.
335
+ # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
336
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_google__flan-ul2)
337
+
338
+ | Metric |Value|
339
+ |-------------------|----:|
340
+ |Avg. |13.55|
341
+ |IFEval (0-Shot) |23.93|
342
+ |BBH (3-Shot) |30.02|
343
+ |MATH Lvl 5 (4-Shot)| 0.15|
344
+ |GPQA (0-shot) | 5.03|
345
+ |MuSR (0-shot) | 5.58|
346
+ |MMLU-PRO (5-shot) |16.59|
347
+