Update README.md

Browse files

Files changed (1) hide show

README.md +23 -11

README.md CHANGED Viewed

@@ -42,20 +42,32 @@ DNA 1.0 8B Instruct was fine-tuned on approximately 10B tokens of carefully cura
 We evaluated DNA 1.0 8B Instruct against other prominent language models of similar size across various benchmarks, including Korean-specific tasks and general language understanding metrics. More details will be provided in the upcoming <u>Technical Report</u>.
-| Language | Benchmark  | **dnotitia/DNA-1.0-8B-Instruct** | LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct | yanolja/EEVE-Korean-Instruct-10.8B-v1.0 | meta-llama/Llama-3.1-8B-Instruct | mistralai/Mistral-7B-Instruct-v0.3 | NCSOFT/Llama-VARCO-8B-Instruct | upstage/SOLAR-10.7B-Instruct-v1.0 |
-|----------|------------|----------------------------------|--------------------------------------|-----------------------------------------|----------------------------------|------------------------------------|--------------------------------|-----------------------------------|
-| Korean   | KMMLU      | **53.26** (1st)                  | <u>45.28</u>                         | 42.17                                   | 41.66                            | 31.45                              | 38.49                          | 41.50                             |
-|          | KMMLU-hard | **29.46** (1st)                  | 20.78                                | 19.25                                   | 20.49                            | 17.86                              | 19.83                          | 20.61                             |
-|          | KoBEST     | **83.40** (1st)                  | 80.13                                | <u>81.67</u>                            | 67.56                            | 63.77                              | 72.99                          | 73.26                             |
-|          | Belebele   | **57.99** (1st)                  | 45.11                                | 49.40                                   | <u>54.70</u>                     | 40.31                              | 53.17                          | 48.68                             |
-|          | CSATQA     | **43.32** (1st)                  | 34.76                                | <u>39.57</u>                            | 36.90                            | 27.27                              | 32.62                          | 34.22                             |
-| English  | MMLU       | <u>66.59</u> (2nd)               | 64.32                                | 63.63                                   | **68.26**                        | 62.04                              | 63.25                          | 65.30                             |
-|          | MMLU-Pro   | **43.05** (1st)                  | 38.90                                | 32.79                                   | <u>40.92</u>                     | 33.49                              | 37.11                          | 30.25                             |
-|          | GSM8K      | **80.52** (1st)                  | <u>80.06</u>                         | 56.18                                   | 75.82                            | 49.66                              | 64.14                          | 69.22                             |
 - The *highest* *scores* are in **bold** form, and the *second*\-*highest* *scores* are <u>underlined</u>.
-- These results were obtained using a 5-shot evaluation setting.
 ## Quickstart

 We evaluated DNA 1.0 8B Instruct against other prominent language models of similar size across various benchmarks, including Korean-specific tasks and general language understanding metrics. More details will be provided in the upcoming <u>Technical Report</u>.
+| Language | Benchmark  | **dnotitia/DNA-1.0-8B-Instruct** | LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct | LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct | yanolja/EEVE-Korean-Instruct-10.8B-v1.0 | Qwen/Qwen2.5-7B-Instruct | meta-llama/Llama-3.1-8B-Instruct | mistralai/Mistral-7B-Instruct-v0.3 | NCSOFT/Llama-VARCO-8B-Instruct | upstage/SOLAR-10.7B-Instruct-v1.0 |
+|----------|------------|----------------------------------|--------------------------------------|--------------------------------------|-----------------------------------------|--------------------------|----------------------------------|------------------------------------|--------------------------------|-----------------------------------|
+| Korean   | KMMLU      | **53.26** (1st)                  | 45.30                                | 45.28                                | 42.17                                   | 45.66                    | 41.66                            | 31.45                              | 38.49                          | 41.50                             |
+|          | KMMLU-hard | **29.46** (1st)                  | 23.17                                | 20.78                                | 19.25                                   | 24.78                    | 20.49                            | 17.86                              | 19.83                          | 20.61                             |
+|          | KoBEST     | **83.40** (1st)                  | 79.05                                | 80.13                                | <u>81.67</u>                            | 78.51                    | 67.56                            | 63.77                              | 72.99                          | 73.26                             |
+|          | Belebele   | **57.99** (1st)                  |                                      | 45.11                                | 49.40                                   | 54.85                    | 54.70                            | 40.31                              | 53.17                          | 48.68                             |
+|          | CSATQA     | **43.32** (1st)                  | 40.11                                | 34.76                                | 39.57                                   | 45.45                    | 36.90                            | 27.27                              | 32.62                          | 34.22                             |
+| English  | MMLU       | 66.59 (3rd)                      | 65.27                                | 64.32                                | 63.63                                   | **74.26**                | <u>68.26</u>                     | 62.04                              | 63.25                          | 65.30                             |
+|          | MMLU-Pro   | **43.05** (1st)                  |                                      | 38.90                                | 32.79                                   | <u>42.5</u>              | 40.92                            | 33.49                              | 37.11                          | 30.25                             |
+|          | GSM8K      | **80.52** (1st)                  | 65.96                                | <u>80.06</u>                         | 56.18                                   | 75.74                    | 75.82                            | 49.66                              | 64.14                          | 69.22                             |
 - The *highest* *scores* are in **bold** form, and the *second*\-*highest* *scores* are <u>underlined</u>.
+**Evaluation Protocol**
+For easy reproduction of our evaluation results, we list the evaluation tools and settings used below:
+|            | Evaluation setting | Metric                              | Evaluation tool |
+|------------|--------------------|-------------------------------------|-----------------|
+| KMMLU      | 5-shot             | mean / exact\_match                 | lm-eval-harness |
+| KMMLU Hard | 5-shot             | mean / exact\_match                 | lm-eval-harness |
+| KoBEST     | 5-shot             | macro\_avg / f1                     | lm-eval-harness |
+| Belebele   | 0-shot             | mean / acc                          | lm-eval-harness |
+| CSATQA     | 0-shot             | mean / acc\_norm                    | lm-eval-harness |
+| MMLU       | 5-shot             | mean / acc                          | lm-eval-harness |
+| MMLU Pro   | 5-shot             | mean / exact\_match                 | lm-eval-harness |
+| GSM8K      | 5-shot             | acc, exact\_match & strict\_extract | lm-eval-harness |
 ## Quickstart