Update README.md

Browse files

Files changed (1) hide show

README.md +88 -0

README.md CHANGED Viewed

@@ -16,6 +16,7 @@ tags:
 ---
 **Update**
 - 01.03.2024 - Reuploaded the model in bfloat16 dtype.
 ![SauerkrautLM](https://vago-solutions.de/wp-content/uploads/2024/02/sauerkrautgemma.jpeg "SauerkrautLM-Gemma-7b")
 ## VAGO solutions SauerkrautLM-Gemma-7b (alpha)
@@ -105,6 +106,93 @@ ASSISTANT:
 | Winogrande (5-shot)   | 76.64  |
 | GSM8K (5-shot)        | 63.68        |
 Despite the fact that we archived great results on the Open LLM leaderboard benchmarks the model subjectively does not feel as smart as comparable mistral finetunes. Most of its answers are coherent but we observed that the model sometimes answers realy lazy or odd.
 ## Disclaimer

 ---
 **Update**
 - 01.03.2024 - Reuploaded the model in bfloat16 dtype.
+- 02.03.2024 - **strongest Gemma finetune model so far**: added AGIEval,GPT4ALL and Bigbench
 ![SauerkrautLM](https://vago-solutions.de/wp-content/uploads/2024/02/sauerkrautgemma.jpeg "SauerkrautLM-Gemma-7b")
 ## VAGO solutions SauerkrautLM-Gemma-7b (alpha)
 | Winogrande (5-shot)   | 76.64  |
 | GSM8K (5-shot)        | 63.68        |
+**Performance**
+|                                 Model                                 |AGIEval|GPT4All|TruthfulQA|BigBench|Average ⬇️|
+|-----------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
+|[VAGOsolutions/SauerkrautLM-Gemma-7b](https://huggingface.co/VAGOsolutions/SauerkrautLM-Gemma-7b)  |  37.5|  72.46|     61.24|   45.33|  54.13|
+|[zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)  |  37.52|  71.77|     55.26|   39.77|  51.08|
+|[zephyr-7b-gemma-v0.1](https://huggingface.co/HuggingFaceH4/zephyr-7b-gemma-v0.1)|  34.22|  66.37|     52.19|   37.10|  47.47|
+|[google/gemma-7b-it](https://huggingface.co/google/gemma-7b-it)        |  21.33|  40.84|     41.70|   30.25|  33.53|
+<details><summary>Details of AGIEval, GPT4All, TruthfulQA, BigBench </summary>
+**AGIEval**
+|            Tasks             |Version|Filter|n-shot| Metric |Value |   |Stderr|
+|------------------------------|------:|------|------|--------|-----:|---|-----:|
+|agieval_sat_math              |      1|none  |None  |acc     |0.3682|±  |0.0326|
+|                              |       |none  |None  |acc_norm|0.3364|±  |0.0319|
+|agieval_sat_en_without_passage|      1|none  |None  |acc     |0.4272|±  |0.0345|
+|                              |       |none  |None  |acc_norm|0.3738|±  |0.0338|
+|agieval_sat_en                |      1|none  |None  |acc     |0.7427|±  |0.0305|
+|                              |       |none  |None  |acc_norm|0.6893|±  |0.0323|
+|agieval_lsat_rc               |      1|none  |None  |acc     |0.5539|±  |0.0304|
+|                              |       |none  |None  |acc_norm|0.5167|±  |0.0305|
+|agieval_lsat_lr               |      1|none  |None  |acc     |0.3431|±  |0.0210|
+|                              |       |none  |None  |acc_norm|0.3471|±  |0.0211|
+|agieval_lsat_ar               |      1|none  |None  |acc     |0.1913|±  |0.0260|
+|                              |       |none  |None  |acc_norm|0.1739|±  |0.0250|
+|agieval_logiqa_en             |      1|none  |None  |acc     |0.3303|±  |0.0184|
+|                              |       |none  |None  |acc_norm|0.3303|±  |0.0184|
+|agieval_aqua_rat              |      1|none  |None  |acc     |0.2480|±  |0.0272|
+|                              |       |none  |None  |acc_norm|0.2323|±  |0.0265|
+Average: 37.5%
+**GPT4All**
+|  Tasks  |Version|Filter|n-shot| Metric |Value |   |Stderr|
+|---------|------:|------|------|--------|-----:|---|-----:|
+|arc_challenge|      1|none  |None  |acc     |0.5358|±  |0.0146|
+|             |       |none  |None  |acc_norm|0.5597|±  |0.0145|
+|arc_easy     |      1|none  |None  |acc     |0.8249|±  |0.0078|
+|             |       |none  |None  |acc_norm|0.7955|±  |0.0083|
+|boolq        |      2|none  |None  |acc     |0.8651|±  |0.006 |
+|hellaswag    |      1|none  |None  |acc     |0.6162|±  |0.0049|
+|             |       |none  |None  |acc_norm|0.8117|±  |0.0039|
+|openbookqa   |      1|none  |None  |acc     |0.336|±   |0.0211|
+|             |       |none  |None  |acc_norm|0.470|±   |0.0223|
+|piqa         |      1|none  |None  |acc     |0.7900|±  |0.0095|
+|             |       |none  |None  |acc_norm|0.8096|±  |0.00  |
+|winogrande   |      1|none  |None  |acc     |0.7609|±  |0.012 |
+Average: 72.46%
+**TruthfulQA**
+|    Tasks     |Version|Filter|n-shot|Metric|Value |   |Stderr|
+|--------------|------:|------|-----:|------|-----:|---|-----:|
+|truthfulqa_mc2|      2|none  |     0|acc   |0.6124|±  |0.0148|
+Average: 61.24%
+**Bigbench**
+|                       Tasks                        |Version|     Filter     |n-shot|  Metric   |Value |   |Stderr|
+|----------------------------------------------------|------:|----------------|-----:|-----------|-----:|---|-----:|
+|bbh_zeroshot_tracking_shuffled_objects_three_objects|      2|flexible-extract|     0|exact_match|0.2760|±  |0.0283|
+|bbh_zeroshot_tracking_shuffled_objects_seven_objects|      2||flexible-extract|     0|exact_match|0.1280|± |0.0212|
+|bbh_zeroshot_tracking_shuffled_objects_five_objects |      2|flexible-extract|     0|exact_match|0.1240|±  |0.0209|
+|bbh_zeroshot_temporal_sequences                     |      2|flexible-extract|     0|exact_match|0.4520|±  |0.0315|
+|bbh_zeroshot_sports_understanding                   |      2||flexible-extract|     0|exact_match|0.7120|± |0.0287|
+|bbh_zeroshot_snarks                                 |      2|flexible-extract|     0|exact_match|0.5056|±  |0.0376|
+|bbh_zeroshot_salient_translation_error_detection    |      2|flexible-extract|     0|exact_match|0.4480|±  |0.0315|
+|bbh_zeroshot_ruin_names                             |      2|flexible-extract|     0|exact_match|0.4520|±  |0.0315|
+|bbh_zeroshot_reasoning_about_colored_objects        |      2|flexible-extract|     0|exact_match|0.4800|±  |0.0317|
+|bbh_zeroshot_navigate                               |      2|flexible-extract|     0|exact_match|0.5480|±  |0.0315|
+|bbh_zeroshot_movie_recommendation                   |      2|flexible-extract|     0|exact_match|0.7000|±  |0.0290|
+|bbh_zeroshot_logical_deduction_three_objects        |      2|flexible-extract|     0|exact_match|0.5200|±  |0.0317|
+|bbh_zeroshot_logical_deduction_seven_objects        |      2|flexible-extract|     0|exact_match|0.4120|±  |0.0312|
+|bbh_zeroshot_logical_deduction_five_objects         |      2|flexible-extract|     0|exact_match|0.3840|±  |0.0308|
+|bbh_zeroshot_geometric_shapes                       |      2|flexible-extract|     0|exact_match|0.2920|±  |0.0288|
+|bbh_zeroshot_disambiguation_qa                      |      2|flexible-extract|     0|exact_match|0.6480|±  |0.0303|
+|bbh_zeroshot_date_understanding                     |      2|flexible-extract|     0|exact_match|0.5000|±  |0.0317|
+|bbh_zeroshot_causal_judgement                       |      2|flexible-extract|     0|exact_match|0.5775|±  |0.0362|
+Average: 45.33%
+</details>
 Despite the fact that we archived great results on the Open LLM leaderboard benchmarks the model subjectively does not feel as smart as comparable mistral finetunes. Most of its answers are coherent but we observed that the model sometimes answers realy lazy or odd.
 ## Disclaimer