ChuckMcSneed
commited on
Commit
•
195a19c
1
Parent(s):
cb2b48a
Update README.md
Browse files
README.md
CHANGED
@@ -180,14 +180,6 @@ Then I SLERP-merged it with cognitivecomputations/dolphin-2.2-70b (Needed to bri
|
|
180 |
|
181 |
Absurdly high. That's what happens when you optimize the merges for a benchmark.
|
182 |
|
183 |
-
### Open LLM leaderboard
|
184 |
-
[Leaderboard on Huggingface](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
|
185 |
-
|Model |Average|ARC |HellaSwag|MMLU |TruthfulQA|Winogrande|GSM8K|
|
186 |
-
|--------------------------------|-------|-----|---------|-----|----------|----------|-----|
|
187 |
-
|ChuckMcSneed/Gembo-v1-70b |70.51 |71.25|86.98 |70.85|63.25 |80.51 |50.19|
|
188 |
-
|ChuckMcSneed/SMaxxxer-v1-70b |72.23 |70.65|88.02 |70.55|60.7 |82.87 |60.58|
|
189 |
-
|
190 |
-
Looks like adding a shitton of RP stuff decreased HellaSwag, WinoGrande and GSM8K, but increased TruthfulQA, MMLU and ARC. Interesting. To be hosnest, I'm a bit surprised that it didn't do that much worse.
|
191 |
|
192 |
### WolframRavenwolf
|
193 |
Benchmark by [@wolfram](https://huggingface.co/wolfram)
|
@@ -198,7 +190,16 @@ Artefact2/Gembo-v1-70b-GGUF GGUF Q5_K_M, 4K context, Alpaca format:
|
|
198 |
- ➖ Did NOT follow instructions to answer with just a single letter or more than just a single letter.
|
199 |
|
200 |
This shows that this model can be used for real world use cases as an assistant.
|
201 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
202 |
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_ChuckMcSneed__Gembo-v1-70b)
|
203 |
|
204 |
| Metric |Value|
|
|
|
180 |
|
181 |
Absurdly high. That's what happens when you optimize the merges for a benchmark.
|
182 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
183 |
|
184 |
### WolframRavenwolf
|
185 |
Benchmark by [@wolfram](https://huggingface.co/wolfram)
|
|
|
190 |
- ➖ Did NOT follow instructions to answer with just a single letter or more than just a single letter.
|
191 |
|
192 |
This shows that this model can be used for real world use cases as an assistant.
|
193 |
+
|
194 |
+
### [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
|
195 |
+
[Leaderboard on Huggingface](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
|
196 |
+
|Model |Average|ARC |HellaSwag|MMLU |TruthfulQA|Winogrande|GSM8K|
|
197 |
+
|--------------------------------|-------|-----|---------|-----|----------|----------|-----|
|
198 |
+
|ChuckMcSneed/Gembo-v1-70b |70.51 |71.25|86.98 |70.85|63.25 |80.51 |50.19|
|
199 |
+
|ChuckMcSneed/SMaxxxer-v1-70b |72.23 |70.65|88.02 |70.55|60.7 |82.87 |60.58|
|
200 |
+
|
201 |
+
Looks like adding a shitton of RP stuff decreased HellaSwag, WinoGrande and GSM8K, but increased TruthfulQA, MMLU and ARC. Interesting. To be hosnest, I'm a bit surprised that it didn't do that much worse.
|
202 |
+
|
203 |
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_ChuckMcSneed__Gembo-v1-70b)
|
204 |
|
205 |
| Metric |Value|
|