Spaces:
Running
on
CPU Upgrade
[FLAG] Voicelab/trurl-2-13b: training data surely includes the test data, right?
There's no way that trurl-2-13b, a 13b model beats the best 70b models on MMLU by FAR. It seems like it might know the MMLU testing data based off its absurdly high score.
agree
They do disclose it in their Model Card:
Training data
The training data includes Q&A pairs from various sources including Alpaca comparison data with GPT, Falcon comparison data, Dolly 15k, Oasst1, Phu saferlfhf, ShareGPT version 2023.05.08v0 filtered and cleaned, Voicelab private datasets for JSON data extraction, modification, and analysis, CURLICAT dataset containing journal entries, dataset from Polish wiki with Q&A pairs grouped into conversations, MMLU data in textual format, Voicelab private dataset with sales conversations, arguments and objections, paraphrases, contact reason detection, and corrected dialogues.
Should probably add a column with datasets contamination warning... Nobody can rationally judge this to be the best 13B model going simply by leaderboard average. @clefourrier
The interesting thing is on ARC it gets 60.07, which is 37th for 13B models. The median is around 57.94 and the max is taken up Orca Mini at 63.14.
HelloSwag 80.23, which is 144th, horribly bad amongst 13B models. In fact the median is 81.23, so it did even worse than the median performance. The max is taken up by beaugogh/Llama2-13b-sharegpt4 at 84.53.
MMLU is extremely an outlier 78.59, which dramatically surpasses the max 13B model OpenOrca Platypus which got 59.39. Highly abnormal, and ye as @felixz mentioned, they did mention this in the model card for test data contamination.
Maybe add a column to detect outliers for each parameter size ie do a groupby then 3*std + mean, which for MMLU would have been 73.94, yet that model got 78.59. For skewed distributions, maybe a median + dispersion based approach.
Any model that is trained on the test sets should be removed from the leaderboard manually. An automated process to detect it doesn't seem necessary unless it becomes more common.
Hi! We introduced a flagging system to make it more obvious to users which models shouldn't be reliably trusted! Thank you all for your interest in this issue!
FLAG: This model has been flagged because it's been trained on the test data: MMLU data.