Update Evaluation contents (#1)

Browse files

- Update Evaluation contents (076a14ad204356d766da58ce9aeefb3eadec5e0c)

Co-authored-by: Taekyoon Ted Choi <Taekyoon@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +40 -12

README.md CHANGED Viewed

@@ -101,6 +101,34 @@ Training was done using [beomi/Gemma-EasyLM](https://github.com/Beomi/Gemma-Easy
 Model evaluation metrics and results.
 ### Benchmark Results
 | Category                         | Metric               | Shots      | 7b     |
@@ -117,24 +145,24 @@ Model evaluation metrics and results.
 |                                  | Hellaswag (acc-norm) |            | 63.2   |
 |                                  | Sentineg             |            | 97.98  |
 |                                  | WiC                  |            | 70.95  |
-| **JP Eval Harness (Prompt ver 0.3)** | JcommonsenseQA | 3-shot      | 85.97  |
-|                                  | JNLI                 | 3-shot      | 39.11  |
-|                                  | Marc_ja              | 3-shot      | 96.48  |
-|                                  | JSquad               | 2-shot      | 70.69  |
-|                                  | Jaqket               | 1-shot      | 81.53  |
-|                                  | MGSM                 | 5-shot      | 28.8   |
-| **XWinograd (5-shot)**           | EN                   |            | 90.71  |
-|                                  | FR                   |            | 80.72  |
-|                                  | JP                   |            | 84.15  |
-|                                  | PT                   |            | 80.99  |
-|                                  | RU                   |            | 76.51  |
-|                                  | ZH                   |            | 76.98  |
 | **XCOPA (5-shot)**               | IT                   |            | 72.8   |
 |                                  | ID                   |            | 76.4   |
 |                                  | TH                   |            | 60.2   |
 |                                  | TR                   |            | 65.6   |
 |                                  | VI                   |            | 77.2   |
 |                                  | ZH                   |            | 80.2   |

 Model evaluation metrics and results.
+### Evaluation Scripts
+- For Knowledge / KoBest / XCOPA / XWinograd
+    - [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) v0.4.2
+      ```bash
+      !git clone https://github.com/EleutherAI/lm-evaluation-harness.git
+      !cd lm-evaluation-harness && pip install -r requirements.txt && pip install -e .
+      !lm_eval --model hf \
+        --model_args pretrained=beomi/gemma-mling-7b,dtype="float16" \
+        --tasks "haerae,kobest,kmmlu_direct,cmmlu,ceval-valid,mmlu,xwinograd,xcopa \
+        --num_fewshot "0,5,5,5,5,5,0,5" \
+        --device cuda
+      ```
+- For JP Eval Harness
+    - [Stability-AI/lm-evaluation-harness (`jp-stable` branch)](https://github.com/Stability-AI/lm-evaluation-harness/tree/jp-stable)
+      ```bash
+      !git clone -b jp-stable https://github.com/Stability-AI/lm-evaluation-harness.git
+      !cd lm-evaluation-harness && pip install -e ".[ja]"
+      !pip install 'fugashi[unidic]' && python -m unidic download
+      !cd lm-evaluation-harness && python main.py \
+          --model hf-causal \
+          --model_args pretrained=beomi/gemma-mling-7b,torch_dtype='auto'"
+          --tasks "jcommonsenseqa-1.1-0.3,jnli-1.3-0.3,marc_ja-1.1-0.3,jsquad-1.1-0.3,jaqket_v2-0.2-0.3,xlsum_ja,mgsm"
+          --num_fewshot "3,3,3,2,1,1,5"
+      ```
 ### Benchmark Results
 | Category                         | Metric               | Shots      | 7b     |
 |                                  | Hellaswag (acc-norm) |            | 63.2   |
 |                                  | Sentineg             |            | 97.98  |
 |                                  | WiC                  |            | 70.95  |
 | **XCOPA (5-shot)**               | IT                   |            | 72.8   |
 |                                  | ID                   |            | 76.4   |
 |                                  | TH                   |            | 60.2   |
 |                                  | TR                   |            | 65.6   |
 |                                  | VI                   |            | 77.2   |
 |                                  | ZH                   |            | 80.2   |
+| **JP Eval Harness (Prompt ver 0.3)** | JcommonsenseQA | 3-shot      | 85.97  |
+|                                  | JNLI                 | 3-shot      | 39.11  |
+|                                  | Marc_ja              | 3-shot      | 96.48  |
+|                                  | JSquad               | 2-shot      | 70.69  |
+|                                  | Jaqket               | 1-shot      | 81.53  |
+|                                  | MGSM                 | 5-shot      | 28.8   |
+| **XWinograd (0-shot)**           | EN                   |            | 89.03  |
+|                                  | FR                   |            | 72.29  |
+|                                  | JP                   |            | 82.69  |
+|                                  | PT                   |            | 73.38  |
+|                                  | RU                   |            | 68.57  |
+|                                  | ZH                   |            | 79.17  |