File size: 13,912 Bytes
d733d97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
dataset,prompt,metric,value
amazon_reviews_multi_en,prompt_body_title_to_star,accuracy,0.4774
amazon_reviews_multi_en,prompt_review_to_star,accuracy,0.4246
amazon_reviews_multi_en,prompt_title_to_star,accuracy,0.3018
amazon_reviews_multi_en,median,accuracy,0.4246
amazon_reviews_multi_es,prompt_body_title_to_star,accuracy,0.4024
amazon_reviews_multi_es,prompt_review_to_star,accuracy,0.3892
amazon_reviews_multi_es,prompt_title_to_star,accuracy,0.252
amazon_reviews_multi_es,median,accuracy,0.3892
amazon_reviews_multi_fr,prompt_body_title_to_star,accuracy,0.4004
amazon_reviews_multi_fr,prompt_review_to_star,accuracy,0.3838
amazon_reviews_multi_fr,prompt_title_to_star,accuracy,0.2736
amazon_reviews_multi_fr,median,accuracy,0.3838
amazon_reviews_multi_zh,prompt_body_title_to_star,accuracy,0.3524
amazon_reviews_multi_zh,prompt_review_to_star,accuracy,0.3444
amazon_reviews_multi_zh,prompt_title_to_star,accuracy,0.2596
amazon_reviews_multi_zh,median,accuracy,0.3444
aqua_rat_raw,Answer questions from options,accuracy,0.2283464566929134
aqua_rat_raw,answer_quiz,accuracy,0.24803149606299213
aqua_rat_raw,select_the_best_option,accuracy,0.20078740157480315
aqua_rat_raw,median,accuracy,0.2283464566929134
art_None,choose_hypothesis,accuracy,0.5039164490861618
art_None,choose_hypothesis_believable,accuracy,0.4934725848563969
art_None,choose_hypothesis_desc,accuracy,0.4993472584856397
art_None,choose_hypothesis_likely,accuracy,0.5006527415143603
art_None,choose_hypothesis_options,accuracy,0.5065274151436031
art_None,median,accuracy,0.5006527415143603
banking77_None,direct_to_which_department,accuracy,0.14805194805194805
banking77_None,help_page_topic,accuracy,0.19285714285714287
banking77_None,rephrase_as_banking_term,accuracy,0.20454545454545456
banking77_None,median,accuracy,0.19285714285714287
blbooksgenre_title_genre_classifiction,classify,accuracy,0.25172811059907835
blbooksgenre_title_genre_classifiction,multi-choice,accuracy,0.25057603686635943
blbooksgenre_title_genre_classifiction,premise_context_first,accuracy,0.7494239631336406
blbooksgenre_title_genre_classifiction,median,accuracy,0.25172811059907835
blimp_adjunct_island,grammatical_between_1_2,accuracy,0.609
blimp_adjunct_island,grammatical_between_A_B,accuracy,0.512
blimp_adjunct_island,grammatical_which_one_1_2,accuracy,0.571
blimp_adjunct_island,single_sentence_bad_yes_no,accuracy,0.523
blimp_adjunct_island,single_sentence_good_yes_no,accuracy,0.5
blimp_adjunct_island,median,accuracy,0.523
climate_fever_None,claim_and_all_supporting_evidences,accuracy,0.30944625407166126
climate_fever_None,fifth_evidence_and_claim_itemization,accuracy,0.10358306188925082
climate_fever_None,first_evidence_and_claim_itemization,accuracy,0.09771986970684039
climate_fever_None,second_evidence_and_claim_itemization,accuracy,0.10618892508143322
climate_fever_None,third_evidence_claim_pair,accuracy,0.0996742671009772
climate_fever_None,median,accuracy,0.10358306188925082
codah_codah,affirmative_instruction_after_sentence_and_choices,accuracy,0.24603746397694526
codah_codah,affirmative_instruction_before_sentence_and_choices,accuracy,0.23919308357348704
codah_codah,interrogative_instruction_after_sentence_and_choices,accuracy,0.24351585014409222
codah_codah,median,accuracy,0.24351585014409222
commonsense_qa_None,answer_given_question_without_options,accuracy,0.42424242424242425
commonsense_qa_None,most_suitable_answer,accuracy,0.38493038493038495
commonsense_qa_None,question_answering,accuracy,0.389025389025389
commonsense_qa_None,median,accuracy,0.389025389025389
conv_ai_3_None,ambiguous,accuracy,0.39040207522697795
conv_ai_3_None,clarification_needed,accuracy,0.39040207522697795
conv_ai_3_None,directly_answer,accuracy,0.6095979247730221
conv_ai_3_None,score_give_number,accuracy,0.021184608733246867
conv_ai_3_None,score_how_much,accuracy,0.03156074362300043
conv_ai_3_None,median,accuracy,0.39040207522697795
craigslist_bargains_None,best deal,accuracy,0.440536013400335
craigslist_bargains_None,good deal for seller,accuracy,0.5192629815745393
craigslist_bargains_None,good deal for seller no list price,accuracy,0.7236180904522613
craigslist_bargains_None,good deal for seller no list price implicit,accuracy,0.23785594639865998
craigslist_bargains_None,median,accuracy,0.47989949748743715
emotion_None,answer_question_with_emotion_label,accuracy,0.324
emotion_None,answer_with_class_label,accuracy,0.22
emotion_None,choose_the_best_emotion_label,accuracy,0.3875
emotion_None,reply_with_emoation_label,accuracy,0.424
emotion_None,median,accuracy,0.35575
financial_phrasebank_sentences_allagree,bullish_neutral_bearish,accuracy,0.31051236749116606
financial_phrasebank_sentences_allagree,complementary_industries,accuracy,0.14885159010600707
financial_phrasebank_sentences_allagree,sentiment,accuracy,0.3480565371024735
financial_phrasebank_sentences_allagree,share_price_option,accuracy,0.2928445229681979
financial_phrasebank_sentences_allagree,word_comes_to_mind,accuracy,0.21024734982332155
financial_phrasebank_sentences_allagree,median,accuracy,0.2928445229681979
glue_cola,Following sentence acceptable,accuracy,0.6797698945349953
glue_cola,Make sense yes no,accuracy,0.5781399808245445
glue_cola,Previous sentence acceptable,accuracy,0.5992329817833174
glue_cola,editing,accuracy,0.49185043144774687
glue_cola,is_this_correct,accuracy,0.3077660594439118
glue_cola,median,accuracy,0.5781399808245445
glue_sst2,following positive negative,accuracy,0.8314220183486238
glue_sst2,happy or mad,accuracy,0.8371559633027523
glue_sst2,positive negative after,accuracy,0.8096330275229358
glue_sst2,review,accuracy,0.8979357798165137
glue_sst2,said,accuracy,0.8853211009174312
glue_sst2,median,accuracy,0.8371559633027523
head_qa_en,multiple_choice_a_and_q_en,accuracy,0.2759882869692533
head_qa_en,multiple_choice_a_and_q_with_context_en,accuracy,0.27159590043923865
head_qa_en,multiple_choice_q_and_a_en,accuracy,0.2767203513909224
head_qa_en,multiple_choice_q_and_a_index_en,accuracy,0.2591508052708638
head_qa_en,multiple_choice_q_and_a_index_with_context_en,accuracy,0.24377745241581258
head_qa_en,median,accuracy,0.27159590043923865
head_qa_es,multiple_choice_a_and_q_en,accuracy,0.25036603221083453
head_qa_es,multiple_choice_a_and_q_with_context_en,accuracy,0.24597364568081992
head_qa_es,multiple_choice_q_and_a_en,accuracy,0.26281112737920936
head_qa_es,multiple_choice_q_and_a_index_en,accuracy,0.25109809663250365
head_qa_es,multiple_choice_q_and_a_index_with_context_en,accuracy,0.24743777452415813
head_qa_es,median,accuracy,0.25036603221083453
health_fact_None,claim_explanation_classification,accuracy,0.5681632653061225
health_fact_None,claim_veracity_classification_after_reading_I_believe,accuracy,0.1510204081632653
health_fact_None,claim_veracity_classification_tell_me,accuracy,0.4563265306122449
health_fact_None,median,accuracy,0.4563265306122449
hlgd_None,is_same_event_editor_asks,accuracy,0.6578057032382794
hlgd_None,is_same_event_interrogative_talk,accuracy,0.6428226196230062
hlgd_None,is_same_event_refer,accuracy,0.7249879168680522
hlgd_None,is_same_event_with_time_interrogative_related,accuracy,0.6863218946350894
hlgd_None,is_same_event_with_time_interrogative_talk,accuracy,0.6863218946350894
hlgd_None,median,accuracy,0.6863218946350894
hyperpartisan_news_detection_byarticle,consider_does_it_follow_a_hyperpartisan_argumentation,accuracy,0.6310077519379845
hyperpartisan_news_detection_byarticle,consider_it_exhibits_extreme_one_sidedness,accuracy,0.6310077519379845
hyperpartisan_news_detection_byarticle,consume_with_caution,accuracy,0.6325581395348837
hyperpartisan_news_detection_byarticle,extreme_left_wing_or_right_wing,accuracy,0.5829457364341085
hyperpartisan_news_detection_byarticle,follows_hyperpartisan_argumentation,accuracy,0.6201550387596899
hyperpartisan_news_detection_byarticle,median,accuracy,0.6310077519379845
liar_None,Given statement guess category,accuracy,0.19314641744548286
liar_None,median,accuracy,0.19314641744548286
lince_sa_spaeng,express sentiment,accuracy,0.5535233996772458
lince_sa_spaeng,negation template,accuracy,0.16460462614308768
lince_sa_spaeng,original poster expressed sentiment,accuracy,0.5438407746100054
lince_sa_spaeng,sentiment trying to express,accuracy,0.5384615384615384
lince_sa_spaeng,the author seem,accuracy,0.5368477676169984
lince_sa_spaeng,median,accuracy,0.5384615384615384
math_qa_None,choose_correct_og,accuracy,0.21608040201005024
math_qa_None,first_choice_then_problem,accuracy,0.19631490787269681
math_qa_None,gre_problem,accuracy,0.20971524288107202
math_qa_None,pick_the_correct,accuracy,0.21206030150753769
math_qa_None,problem_set_type,accuracy,0.2793969849246231
math_qa_None,median,accuracy,0.21206030150753769
mlsum_es,layman_summ_es,bleu,0.027120352148581234
mlsum_es,palm_prompt,bleu,0.028540407645253642
mlsum_es,summarise_this_in_es_few_sentences,bleu,0.02865384931959682
mlsum_es,median,bleu,0.028540407645253642
movie_rationales_None,Evidences + review,accuracy,0.975
movie_rationales_None,Evidences sentiment classification,accuracy,0.975
movie_rationales_None,Standard binary sentiment analysis,accuracy,0.875
movie_rationales_None,median,accuracy,0.975
mwsc_None,in-the-sentence,accuracy,0.573170731707317
mwsc_None,in-the-sentence-question-first,accuracy,0.524390243902439
mwsc_None,is-correct,accuracy,0.5121951219512195
mwsc_None,options-or,accuracy,0.5365853658536586
mwsc_None,what-think,accuracy,0.5121951219512195
mwsc_None,median,accuracy,0.524390243902439
onestop_english_None,ara_context,accuracy,0.3403880070546737
onestop_english_None,assess,accuracy,0.3350970017636684
onestop_english_None,determine_reading_level_from_the_first_three_sentences,accuracy,0.345679012345679
onestop_english_None,esl_context,accuracy,0.328042328042328
onestop_english_None,esl_variation,accuracy,0.32275132275132273
onestop_english_None,median,accuracy,0.3350970017636684
poem_sentiment_None,guess_sentiment_without_options_variation_1,accuracy,0.2571428571428571
poem_sentiment_None,most_appropriate_sentiment,accuracy,0.3238095238095238
poem_sentiment_None,positive_or_negative_sentiment_variation_1,accuracy,0.2571428571428571
poem_sentiment_None,positive_or_negative_sentiment_variation_2,accuracy,0.29523809523809524
poem_sentiment_None,question_answer_format,accuracy,0.3333333333333333
poem_sentiment_None,median,accuracy,0.29523809523809524
pubmed_qa_pqa_labeled,Long Answer to Final Decision,accuracy,0.635
pubmed_qa_pqa_labeled,Question Answering (Short),accuracy,0.543
pubmed_qa_pqa_labeled,median,accuracy,0.589
riddle_sense_None,answer_given_question_without_options,accuracy,0.39862879529872675
riddle_sense_None,most_suitable_answer,accuracy,0.24877571008814886
riddle_sense_None,question_answering,accuracy,0.24191968658178256
riddle_sense_None,question_to_answer_index,accuracy,0.1929480901077375
riddle_sense_None,median,accuracy,0.24534769833496572
scicite_None,Classify intent,accuracy,0.17903930131004367
scicite_None,Classify intent (choices first),accuracy,0.15065502183406113
scicite_None,Classify intent (select choice),accuracy,0.15611353711790393
scicite_None,Classify intent w/section (select choice),accuracy,0.21724890829694324
scicite_None,can_describe,accuracy,0.259825327510917
scicite_None,median,accuracy,0.17903930131004367
selqa_answer_selection_analysis,is-he-talking-about,accuracy,0.8866242038216561
selqa_answer_selection_analysis,make-sense-rand,accuracy,0.8738853503184714
selqa_answer_selection_analysis,which-answer-1st-vs-random,accuracy,0.5171974522292994
selqa_answer_selection_analysis,would-make-sense-qu-rand,accuracy,0.8802547770700637
selqa_answer_selection_analysis,median,accuracy,0.8770700636942675
snips_built_in_intents_None,categorize_query,accuracy,0.3597560975609756
snips_built_in_intents_None,categorize_query_brief,accuracy,0.5396341463414634
snips_built_in_intents_None,intent_query,accuracy,0.1676829268292683
snips_built_in_intents_None,query_intent,accuracy,0.4603658536585366
snips_built_in_intents_None,voice_intent,accuracy,0.39634146341463417
snips_built_in_intents_None,median,accuracy,0.39634146341463417
wmt14_fr_en_en-fr,a_good_translation-en-fr-source+target,bleu,0.023458085666003575
wmt14_fr_en_en-fr,a_good_translation-en-fr-target,bleu,0.021816396665585328
wmt14_fr_en_en-fr,gpt3-en-fr,bleu,0.00025997559795041045
wmt14_fr_en_en-fr,version-en-fr-target,bleu,0.02227540952829272
wmt14_fr_en_en-fr,xglm-en-fr-target,bleu,0.06208490978871131
wmt14_fr_en_en-fr,median,bleu,0.02227540952829272
wmt14_fr_en_fr-en,a_good_translation-fr-en-source+target,bleu,0.25380925876631605
wmt14_fr_en_fr-en,a_good_translation-fr-en-target,bleu,0.16622960261108521
wmt14_fr_en_fr-en,gpt3-fr-en,bleu,0.008644457350404656
wmt14_fr_en_fr-en,version-fr-en-target,bleu,0.1606615213941368
wmt14_fr_en_fr-en,xglm-fr-en-target,bleu,0.1760990614881427
wmt14_fr_en_fr-en,median,bleu,0.16622960261108521
wmt14_hi_en_en-hi,a_good_translation-en-hi-source+target,bleu,0.004949179557007594
wmt14_hi_en_en-hi,a_good_translation-en-hi-target,bleu,0.0026204531297437587
wmt14_hi_en_en-hi,gpt-3-en-hi-target,bleu,1.0125186357154061e-26
wmt14_hi_en_en-hi,version-en-hi-target,bleu,0.002859269911787752
wmt14_hi_en_en-hi,xglm-en-hi-target,bleu,0.000291688799662918
wmt14_hi_en_en-hi,median,bleu,0.0026204531297437587
wmt14_hi_en_hi-en,a_good_translation-hi-en-source+target,bleu,0.110920558209937
wmt14_hi_en_hi-en,a_good_translation-hi-en-target,bleu,0.07969255346070397
wmt14_hi_en_hi-en,gpt-3-hi-en-target,bleu,1.815212789755798e-57
wmt14_hi_en_hi-en,version-hi-en-target,bleu,0.10179796760814867
wmt14_hi_en_hi-en,xglm-hi-en-target,bleu,0.08420322693635622
wmt14_hi_en_hi-en,median,bleu,0.08420322693635622
multiple,average,multiple,0.37171450318227334