Update README.md
Browse files
README.md
CHANGED
@@ -106,6 +106,24 @@ When using one of the quantized versions, make sure to pass the quantization con
|
|
106 |
}
|
107 |
```
|
108 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
109 |
## Example output
|
110 |
|
111 |
**User:**
|
|
|
106 |
}
|
107 |
```
|
108 |
|
109 |
+
## Standard evaluations
|
110 |
+
|
111 |
+
cosmosage can be compared to OpenHermes-2.5-Mistral-7B using standard evaluation metrics.
|
112 |
+
|
113 |
+
| Test Category | cosmosage_v2 | OpenHermes-2.5-Mistral-7B |
|
114 |
+
|---------------|-------------------------|------------------------------------|
|
115 |
+
| Overall | 0.595 | 0.632 |
|
116 |
+
| ARC Challenge | 0.565 | 0.613 |
|
117 |
+
| Hellaswag | 0.619 | 0.652 |
|
118 |
+
| TruthfulQA:mc1 | 0.348 | 0.361 |
|
119 |
+
| TruthfulQA:mc2 | 0.510 | 0.522 |
|
120 |
+
| Winogrande | 0.759 | 0.781 |
|
121 |
+
| GSM8k | 0.368 | 0.261 |
|
122 |
+
|
123 |
+
cosmosage performs only slightly below OpenHermes-2.5-Mistral-7B on most metrics, indicating that the heavy
|
124 |
+
specialization in cosmology has left its general-purpose abilities nearly unchanged. The exception is GSM8k,
|
125 |
+
which is a collection of grade school math problems. Here, cosmosage performs slightly better.
|
126 |
+
|
127 |
## Example output
|
128 |
|
129 |
**User:**
|