killawhale2
commited on
Commit
•
0dfb32e
1
Parent(s):
12acc01
Update README.md
Browse files
README.md
CHANGED
@@ -56,6 +56,26 @@ Using the datasets mentioned above, we applied SFT and iterative DPO training, a
|
|
56 |
|
57 |
[2] Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y., Kwok, J.T., Li, Z., Weller, A. and Liu, W., 2023. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
|
58 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
# **Evaluation Results**
|
60 |
|
61 |
| Model | H6 | Model Size |
|
|
|
56 |
|
57 |
[2] Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y., Kwok, J.T., Li, Z., Weller, A. and Liu, W., 2023. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
|
58 |
|
59 |
+
# **Data Contamination Test Results**
|
60 |
+
|
61 |
+
Recently, there have been contamination issues in some models on the LLM leaderboard.
|
62 |
+
We note that we made every effort to exclude any benchmark-related datasets from training.
|
63 |
+
We also ensured the integrity of our model by conducting a data contamination test [3] that is also used by the HuggingFace team [4, 5].
|
64 |
+
|
65 |
+
Our results, with `result < 0.1, %:` being well below 0.9, indicate that our model is free from contamination.
|
66 |
+
|
67 |
+
*The data contamination test results of HellaSwag and Winograde will be added once [3] supports them.*
|
68 |
+
|
69 |
+
| Model | ARC | MMLU | TruthfulQA | GSM8K |
|
70 |
+
|------------------------------|-------|-------|-------|-------|
|
71 |
+
| **SOLAR-10.7B-Instruct-v1.0**| result < 0.1, %: 0.06 |result < 0.1, %: 0.15 | result < 0.1, %: 0.28 | result < 0.1, %: 0.70 |
|
72 |
+
|
73 |
+
[3] https://github.com/swj0419/detect-pretrain-code-contamination
|
74 |
+
|
75 |
+
[4] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/474#657f2245365456e362412a06
|
76 |
+
|
77 |
+
[5] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/265#657b6debf81f6b44b8966230
|
78 |
+
|
79 |
# **Evaluation Results**
|
80 |
|
81 |
| Model | H6 | Model Size |
|