Spaces:
Restarting
on
CPU Upgrade
Update on Leaderboard Results - Bug in Macro F1 Score Calculation
The team at @maritaca-ai helped me identify a bug that affected some models in tasks where the F1 score is the evaluation metric. The issue happens when a model generates an invalid response, leading to the calculation of the F1 score with an macro average. Previously, the code incorrectly included the placeholder "[invalid]" as an additional class in the score averaging process. This significantly lowered the scores for models on tasks where they produced any invalid responses.
The evaluation code has been updated to exclude the "[invalid]" tag, and the leaderboard results have been revised to reflect the correct values. As a result, some models may have changed in rank and overall average score.
Below is the list of affected models:
Models not on this list are not affected.
Model Name Precision Old_score->New_score
152334H/miqu-1-70b-sf float16 71.51->73.50
abhishek/autotrain-llama3-orpo-v2 bfloat16 10.62->13.68
ai-forever/mGPT-13B float16 9.61->12.19
allenai/tulu-2-dpo-70b bfloat16 72.13->74.13
allknowingroger/MultiverseEx26-7B-slerp bfloat16 67.70->69.52
argilla/CapybaraHermes-2.5-Mistral-7B float16 65.08->66.67
automerger/YamshadowExperiment28-7B bfloat16 67.79->69.62
argilla/notux-8x7b-v1 bfloat16 67.69->73.10
axolotl-ai-co/romulus-mistral-nemo-12b-simpo bfloat16 69.32->71.97
baichuan-inc/Baichuan2-13B-Chat bfloat16 43.66->46.84
BAAI/Infinity-Instruct-3M-0625-Mistral-7B bfloat16 69.01->70.91
BAAI/Infinity-Instruct-3M-0613-Mistral-7B float16 68.25->70.10
bardsai/jaskier-7b-dpo-v5.6 bfloat16 67.58->69.41
berkeley-nest/Starling-LM-7B-alpha bfloat16 67.90->69.59
bardsai/jaskier-7b-dpo-v5.6 float16 67.76->69.65
chujiezheng/Llama-3-Instruct-8B-SimPO-ExPO bfloat16 65.82->67.71
cognitivecomputations/dolphin-2.9.3-mistral-7B-32k bfloat16 65.03->66.84
cognitivecomputations/openchat-3.5-0106-laser bfloat16 68.35->70.18
cognitivecomputations/WestLake-7B-v2-laser bfloat16 67.00->68.80
cognitivecomputations/laserxtral bfloat16 67.52->69.32
Columbia-NLP/LION-LLaMA-3-8b-odpo-v1.0 bfloat16 58.43->68.93
cognitivecomputations/dolphin-2.9.3-mistral-7B-32k bfloat16 65.03->66.84
Danielbrdz/Barcenas-14b-Phi-3-medium-ORPO float16 70.17->72.03
CohereForAI/c4ai-command-r-v01 float16 66.49->68.28
Danielbrdz/Barcenas-Llama3-8b-ORPO float16 68.25->70.10
CultriX/NeuralMona_MoE-4x7B bfloat16 67.32->69.11
DeepMount00/Llama-3-8b-Ita bfloat16 68.78->70.65
dominguesm/mambarim-110m float16 14.16->18.01
eduagarcia/gemma-7b-it_no_chat_template bfloat16 55.17->57.28
dzakwan/dzakwan-MoE-4x7b-Beta float16 53.52->55.83
eldogbbhed/Peagle-9b float16 51.89->53.35
EleutherAI/pythia-14m float16 18.90->22.62
EleutherAI/pythia-70m-deduped float16 19.37->25.59
EleutherAI/pythia-70m float16 22.73->23.18
failspy/Meta-Llama-3-8B-Instruct-abliterated-v3 bfloat16 68.82->70.65
failspy/Phi-3-medium-4k-instruct-abliterated-v3 bfloat16 68.92->70.66
freewheelin/free-solar-evo-v0.1 float16 43.61->51.79
freewheelin/free-solar-evo-v0.11 float16 44.30->52.72
freewheelin/free-solar-evo-v0.13 float16 46.17->55.48
FuseAI/FuseChat-7B-VaRM bfloat16 67.49->69.12
ghost-x/ghost-8b-beta bfloat16 61.66->63.66
ghost-x/ghost-8b-beta-1608 bfloat16 60.98->62.97
google/mt5-base bfloat16 8.87->10.16
google/mt5-small bfloat16 0.81->0.81
GritLM/GritLM-7B-KTO bfloat16 65.04->66.71
google/mt5-base float16 8.89->10.25
grimjim/Llama-3-Instruct-8B-SPPO-Iter3-SimPO-merge bfloat16 68.04->69.80
GritLM/GritLM-7B bfloat16 65.84->67.52
HuggingFaceH4/zephyr-7b-beta bfloat16 62.77->64.47
HuggingFaceTB/SmolLM-1.7B-Instruct bfloat16 17.65->23.10
hkust-nlp/deita-7b-v1.0 bfloat16 64.48->66.32
HuggingFaceTB/SmolLM-135M-Instruct bfloat16 13.00->16.02
HuggingFaceTB/SmolLM-360M-Instruct bfloat16 17.69->21.05
ibivibiv/llama-3-nectar-dpo-8B bfloat16 68.37->70.19
Intel/neural-chat-7b-v3-3 float16 65.34->67.07
ibivibiv/multimaster-7b-v6 bfloat16 67.35->69.20
Intel/neural-chat-7b-v3-1 float16 67.27->69.17
internlm/internlm2-chat-20b float16 64.59->67.58
internlm/internlm2_5-1_8b bfloat16 36.04->37.67
internlm/internlm2-chat-20b-sft float16 59.35->64.76
internlm/internlm2_5-20b-chat bfloat16 49.84->56.94
invalid-coder/Sakura-SOLAR-Instruct-CarbonVillain-en-10.7B-v2-slerp float16 69.40->71.37
jeonsworld/CarbonVillain-en-10.7B-v4 bfloat16 69.35->71.31
JJhooww/Mistral_Relora_Step2k float16 64.42->66.34
internlm/internlm2-chat-20b float16 64.59->67.58
jsfs11/MixtureofMerges-MoE-4x7b-v5 bfloat16 67.71->69.53
JJhooww/Mistral_Relora_Step2k bfloat16 64.22->66.13
jpacifico/Chocolatine-14B-Instruct-4k-DPO float16 69.85->71.69
jsfs11/MixtureofMerges-MoE-4x7b-v4 bfloat16 67.63->69.44
Kquant03/CognitiveFusion2-4x7B-BF16 bfloat16 67.64->69.47
kekmodel/StopCarbon-10.7B-v5 float16 69.62->71.61
Kukedlc/NeuralExperiment-7b-MagicCoder-v7.5 float16 67.37->69.21
Kukedlc/NeuralSynthesis-7B-v0.1 bfloat16 67.76->69.57
Kukedlc/NeuralSynthesis-7b-v0.4-slerp bfloat16 67.71->69.54
LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct bfloat16 61.84->64.41
Kukedlc/NeuralSynthesis-7B-v0.3 bfloat16 67.69->69.52
liminerity/M7-7b bfloat16 67.72->69.55
lrds-code/boana-7b-instruct bfloat16 44.57->46.07
M4-ai/tau-0.5B float16 26.38->27.61
lucianosb/boto-27B bfloat16 27.47->35.78
lrds-code/samba-1.1B bfloat16 16.89->20.51
M4-ai/tau-1.8B bfloat16 31.82->36.40
Magpie-Align/Llama-3-8B-Magpie-Align-v0.3 bfloat16 51.26->63.60
matheusrdgsf/cesar-ptbr GPTQ 59.22->64.04
maywell/Synatra-7B-v0.3-RP float16 57.67->60.98
MaziyarPanahi/Llama-3-8B-Instruct-v0.8 bfloat16 68.85->70.72
MaziyarPanahi/Llama-3-8B-Instruct-v0.10 bfloat16 68.77->70.63
MaziyarPanahi/Mistral-7B-Instruct-v0.3 bfloat16 66.30->68.06
MaziyarPanahi/Calme-4x7B-MoE-v0.2 bfloat16 49.00->50.77
MaziyarPanahi/Mistral-7B-Instruct-Aya-101 bfloat16 64.63->66.49
MaziyarPanahi/Calme-4x7B-MoE-v0.1 bfloat16 49.12->50.94
MaziyarPanahi/Llama-3-8B-Instruct-v0.9 bfloat16 68.86->70.71
MaziyarPanahi/Topxtral-4x7B-v0.1 bfloat16 67.48->69.28
meraGPT/mera-mix-4x7B bfloat16 67.72->69.52
meta-llama/Llama-2-7b-chat-hf bfloat16 42.36->52.20
microsoft/phi-1_5 float16 28.41->29.64
microsoft/Phi-3-medium-4k-instruct bfloat16 70.42->72.26
mistralai/Mixtral-8x7B-Instruct-v0.1 bfloat16 69.71->73.14
mistralai/Mistral-7B-Instruct-v0.2 bfloat16 64.81->66.68
mistralai/Mistral-7B-Instruct-v0.3 bfloat16 66.30->68.06
mlabonne/AlphaMonarch-7B float16 50.16->53.62
mlabonne/Beyonder-4x7B-v3 float16 53.47->55.79
mlabonne/Monarch-7B bfloat16 67.01->68.80
mlabonne/Llama-3-8B-Instruct-abliterated-dpomix float16 68.53->70.35
MulaBR/Mula-4x160-v0.1 float16 26.24->26.66
mlabonne/NeuralMonarch-7B float16 50.30->53.78
MulaBR/Mula-8x160-v0.1 float16 25.72->27.65
Nexusflow/Starling-LM-7B-beta bfloat16 69.03->70.90
nicholasKluge/TeenyTinyLlama-160m bfloat16 28.20->28.62
MTSAIR/multi_verse_model bfloat16 48.69->53.95
NLPark/AnFeng_v3_Avocet bfloat16 16.63->23.20
NousResearch/Nous-Hermes-2-Mistral-7B-DPO bfloat16 61.73->66.75
NOVA-vision-language/GlorIA-1.3B float16 4.10->5.44
OliveiraJLT/Sagui-7B-Instruct-v0.1 bfloat16 39.87->41.56
openchat/openchat-3.5-0106 bfloat16 68.69->70.55
openai-community/openai-gpt float16 1.58->1.96
OliveiraJLT/Sagui-7B-Instruct-v0.1 bfloat16 39.87->41.56
OliveiraJLT/Sagui-7B-Instruct-v0.1 bfloat16 39.87->41.56
paulml/OGNO-7B bfloat16 67.63->69.45
princeton-nlp/Llama-3-Instruct-8B-SimPO bfloat16 66.43->68.31
princeton-nlp/Llama-3-Instruct-8B-SimPO-v0.2 bfloat16 55.06->68.41
Qwen/Qwen-72B-Chat bfloat16 30.80->33.70
Qwen/Qwen1.5-0.5B bfloat16 25.74->28.75
Qwen/Qwen-1_8B-Chat bfloat16 37.65->39.27
Qwen/Qwen1.5-1.8B bfloat16 30.14->32.66
Qwen/Qwen1.5-110B-Chat bfloat16 72.74->74.67
Qwen/Qwen-1_8B-Chat float16 36.70->38.32
Qwen/Qwen1.5-32B bfloat16 62.88->64.32
Qwen/Qwen1.5-110B-Chat 4bit 72.51->74.41
Qwen/Qwen2-0.5B bfloat16 27.58->30.14
Ramikan-BR/tinyllama-coder-py-4bit-v10 float16 27.68->29.62
recogna-nlp/bode-7b-alpaca-pt-br float16 53.21->54.82
recogna-nlp/mistralbode_7b_qlora_ultraalpaca float16 63.57->65.35
rhaymison/Mistral-8x7b-portuguese-luana float16 66.05->71.33
rhaymison/gemma-portuguese-tom-cat-2b-it float16 31.76->36.70
rhaymison/gemma-portuguese-2b-it bfloat16 4.23->6.35
rhaymison/Mistral-portuguese-luana-7b-Mathematics float16 63.60->65.41
rishiraj/CatPPT-base float16 67.92->69.65
RLHFlow/LLaMA3-iterative-DPO-final bfloat16 61.53->68.95
rishiraj/CatPPT bfloat16 68.06->69.80
RubielLabarta/LogoS-7Bx2-MoE-13B-v0.2 bfloat16 67.55->69.37
royallab/ZephRP-m7b bfloat16 63.25->64.98
rombodawg/Everyone-Coder-4x7b-Base float16 64.12->65.78
saltlux/luxia-21.4b-alignment-v1.2 bfloat16 66.25->68.09
shadowml/BeagSake-7B bfloat16 52.74->56.79
rhaymison/Mistral-portuguese-luana-7b-Mathematics bfloat16 63.45->65.26
SeaLLMs/SeaLLM-7B-v2 bfloat16 66.49->68.15
saltlux/luxia-21.4b-alignment-v1.0 bfloat16 67.27->69.10
ssmits/Falcon2-5.5B-Portuguese bfloat16 0.56->0.73
ssmits/Falcon2-5.5B-multilingual bfloat16 0.56->0.73
state-spaces/mamba-1.4b-hf float16 27.72->29.62
saltlux/luxia-21.4b-alignment-v1.0 float16 67.23->69.06
THUDM/chatglm3-6b float16 50.44->56.00
teknium/OpenHermes-2-Mistral-7B bfloat16 63.76->65.51
teknium/OpenHermes-2.5-Mistral-7B bfloat16 64.84->66.47
TheBloke/zephyr-7B-beta-GPTQ GPTQ 59.22->64.04
UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3 bfloat16 67.37->69.14
UCLA-AGI/Mistral7B-PairRM-SPPO-Iter2 bfloat16 58.67->65.47
UCLA-AGI/Mistral7B-PairRM-SPPO-Iter3 bfloat16 55.11->65.24
UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter2 bfloat16 67.62->69.40
TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T float16 32.28->34.30
upstage/SOLAR-10.7B-Instruct-v1.0 float16 69.47->71.44
VAGOsolutions/SauerkrautLM-Nemo-12b-Instruct bfloat16 71.63->73.64
VAGOsolutions/Llama-3-SauerkrautLM-8b-Instruct bfloat16 68.78->70.65
uygarkurt/llama-3-merged-linear float16 68.73->70.59
vicgalle/CarbonBeagle-11B-truthy float16 70.46->72.42
vicgalle/ConfigurableSOLAR-10.7B float16 68.93->70.92
Walmart-the-bag/Misted-v2-7B float16 66.00->67.89
vicgalle/ConfigurableBeagle-11B float16 70.57->72.54
Walmart-the-bag/Quintellect-10.7B float16 65.28->67.13
vicgalle/CarbonBeagle-11B float16 69.64->71.57
Weni/WeniGPT-2.4.1-Zephyr-7B-3-epochs-GPT-QA-1.0.1_DP_DPO float16 61.64->63.40
Weni/ZeroShot-3.4.22-Mistral-7b-DPO-1.0.0 float16 63.11->64.84
Weni/ZeroShot-3.3.34-Mistral-7b-Multilanguage-3.3.0-merged float16 63.05->64.77
Weni/WeniGPT-Mistral-7B-instructBase float16 39.55->44.21
Weni/WeniGPT-Mistral-7B-instructBase-4bit float16 42.14->47.44
vicgalle/ConfigurableBeagle-11B float16 70.57->72.54
vicgalle/ConfigurableBeagle-11B float16 70.57->72.54
yunconglong/Truthful_DPO_TomGrc_FusionNet_7Bx2_MoE_13B bfloat16 66.99->68.84
yunconglong/DARE_TIES_13B bfloat16 66.88->68.73
xverse/XVERSE-65B bfloat16 53.71->55.45
yunconglong/MoE_13B_DPO bfloat16 66.95->68.79
xverse/XVERSE-13B bfloat16 52.59->54.40
zhengr/MixTAO-7Bx2-MoE-v8.1 bfloat16 67.44->69.26
vicgalle/ConfigurableBeagle-11B float16 70.57->72.54
Details on which score of each model where affected by the change can be seen on this commit: https://huggingface.co/datasets/eduagarcia-temp/llm_pt_leaderboard_requests/commit/25143f35bbad78968196e31313b68744896d6d1c