crumbly
/

cramp-25m

@@ -8,7 +8,7 @@ language:
 ---
-A modified GPT-2 model with only 25 million non-embedding params that outbenches GPT-2(124m), Pythia-70m/160m, and Cerebras-111m, it has ScaledSinusoidal position embeddings, embedding layernorm, no biases, and was trained on only 8 billion tokens of the SlimPajama dataset at home on 2xA6000.
 | model | avg | arc | hellaswag | mmlu | truthfulqa |
 | --- | --- | --- | --- | --- | --- |

 ---
+A modified GPT-2 model with only 25 million non-embedding params that outbenches GPT-2(124m), Pythia-70m/160m, and Cerebras-111m, it has ScaledSinusoidal position embeddings, embedding layernorm, no biases, and was trained on only 8 billion tokens of the SlimPajama dataset at home on 2xA6000. (On the graphic it's mis-labeled as cramp-41m)
 | model | avg | arc | hellaswag | mmlu | truthfulqa |
 | --- | --- | --- | --- | --- | --- |