File size: 7,200 Bytes
fb2d2f6
 
 
 
 
 
b9189f6
1fe6c41
b9189f6
207c4da
 
 
fb2d2f6
e03fb62
 
 
 
ffee4cc
e27b807
 
 
f735f15
ceada43
f0ab9e9
e03fb62
 
 
 
 
 
 
 
f3332d1
d4099df
ff0b212
d4099df
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
datasets:
- EleutherAI/the_pile_deduplicated
language:
- en
---

# broken bc of updates to transformers library, let me reimplement and train

GLORT2 (GLORT2 Low Rank Transformer Transformer) is a transformer model where every single linear layer is another smaller transformer model. I combined qkv into one operation which means one transformer instead of 3 to save on parameters, I played w using a transformer on the embeddings but it wasnt .. great, it's 768 dim 10 layers w/ 384 dim 1 layer as the replacements for linear layers (besides embed and lm head)

also sorry I just realized theres some residual from where I copied the model code from in my own projects that includes some "expanded lm head size" stuff just ignore that if you're looking at the config and code this isn't a serious project so I don't care too much that it's there

| model | 512-token strided perplexity on a pile test set | tokens |
| --- | --- | --- | 
| cerebras 111m | 21.550655364990234 | 2.2b |
| cerebras 256m | 15.203496932983398 | 5.1b |
| cerebras 590m | 12.098200798034668 | 11.something b |
| deduped pythia 70m (95.6M) | 22.393400192260742 | 300b |
| deduped pythia 160m (213M) | 13.933751106262207 | 300b |
| deduped pythia 410m (506M) | 9.61842155456543 | 300b |
| llama w same settings as cerebras 111m (119m) | 13.882301330566406 | 2.2b |
| llama plus w same settings as cerebras 111m and llama 70b embeddings (369m) | 13.565109252929688 | 2.2b |
| **GLORT2 (205m)** | 13.051741600036621 | 2.2b |


|    Tasks    |Version|Filter|n-shot| Metric |Value |   |Stderr|
|-------------|------:|------|-----:|--------|-----:|---|-----:|
|arc_challenge|      1|none  |    25|acc     |0.1706|±  |0.0110|
|             |       |none  |    25|acc_norm|0.2099|±  |0.0119|
|truthfulqa_mc2|      2|none  |     0|acc   |0.4599|±  |0.0154|
|winogrande|      1|none  |     5|acc   |0.5083|±  |0.0141|
|hellaswag|      1|none  |    10|acc     |0.2728|±  |0.0044|
|         |       |none  |    10|acc_norm|0.2815|±  |0.0045|
|gsm8k|      2|get-answer|     5|exact_match|    0|±  |     0|


### mmlu

mean is 0.26394385964912276 i think

|               Tasks               |Version|Filter|n-shot|Metric|Value |   |Stderr|
|-----------------------------------|------:|------|-----:|------|-----:|---|-----:|
|world_religions                    |      0|none  |     5|acc   |0.1988|±  |0.0306|
|virology                           |      0|none  |     5|acc   |0.1928|±  |0.0307|
|us_foreign_policy                  |      0|none  |     5|acc   |0.2600|±  |0.0441|
|sociology                          |      0|none  |     5|acc   |0.2438|±  |0.0304|
|security_studies                   |      0|none  |     5|acc   |0.4000|±  |0.0314|
|public_relations                   |      0|none  |     5|acc   |0.2273|±  |0.0401|
|professional_psychology            |      0|none  |     5|acc   |0.2484|±  |0.0175|
|professional_medicine              |      0|none  |     5|acc   |0.4485|±  |0.0302|
|professional_law                   |      0|none  |     5|acc   |0.2445|±  |0.0110|
|professional_accounting            |      0|none  |     5|acc   |0.2482|±  |0.0258|
|prehistory                         |      0|none  |     5|acc   |0.2562|±  |0.0243|
|philosophy                         |      0|none  |     5|acc   |0.2186|±  |0.0235|
|nutrition                          |      0|none  |     5|acc   |0.2941|±  |0.0261|
|moral_scenarios                    |      0|none  |     5|acc   |0.2503|±  |0.0145|
|moral_disputes                     |      0|none  |     5|acc   |0.1965|±  |0.0214|
|miscellaneous                      |      0|none  |     5|acc   |0.2554|±  |0.0156|
|medical_genetics                   |      0|none  |     5|acc   |0.3000|±  |0.0461|
|marketing                          |      0|none  |     5|acc   |0.1966|±  |0.0260|
|management                         |      0|none  |     5|acc   |0.1942|±  |0.0392|
|machine_learning                   |      0|none  |     5|acc   |0.2321|±  |0.0401|
|logical_fallacies                  |      0|none  |     5|acc   |0.2331|±  |0.0332|
|jurisprudence                      |      0|none  |     5|acc   |0.2407|±  |0.0413|
|international_law                  |      0|none  |     5|acc   |0.3719|±  |0.0441|
|human_sexuality                    |      0|none  |     5|acc   |0.2137|±  |0.0360|
|human_aging                        |      0|none  |     5|acc   |0.2646|±  |0.0296|
|high_school_world_history          |      0|none  |     5|acc   |0.2489|±  |0.0281|
|high_school_us_history             |      0|none  |     5|acc   |0.2304|±  |0.0296|
|high_school_statistics             |      0|none  |     5|acc   |0.4722|±  |0.0340|
|high_school_psychology             |      0|none  |     5|acc   |0.3083|±  |0.0198|
|high_school_physics                |      0|none  |     5|acc   |0.3046|±  |0.0376|
|high_school_microeconomics         |      0|none  |     5|acc   |0.3361|±  |0.0307|
|high_school_mathematics            |      0|none  |     5|acc   |0.2630|±  |0.0268|
|high_school_macroeconomics         |      0|none  |     5|acc   |0.3231|±  |0.0237|
|high_school_government_and_politics|      0|none  |     5|acc   |0.3523|±  |0.0345|
|high_school_geography              |      0|none  |     5|acc   |0.3384|±  |0.0337|
|high_school_european_history       |      0|none  |     5|acc   |0.2909|±  |0.0355|
|high_school_computer_science       |      0|none  |     5|acc   |0.2600|±  |0.0441|
|high_school_chemistry              |      0|none  |     5|acc   |0.2709|±  |0.0313|
|high_school_biology                |      0|none  |     5|acc   |0.3161|±  |0.0265|
|global_facts                       |      0|none  |     5|acc   |0.1800|±  |0.0386|
|formal_logic                       |      0|none  |     5|acc   |0.1667|±  |0.0333|
|elementary_mathematics             |      0|none  |     5|acc   |0.2540|±  |0.0224|
|electrical_engineering             |      0|none  |     5|acc   |0.3103|±  |0.0386|
|econometrics                       |      0|none  |     5|acc   |0.2895|±  |0.0427|
|conceptual_physics                 |      0|none  |     5|acc   |0.2553|±  |0.0285|
|computer_security                  |      0|none  |     5|acc   |0.1900|±  |0.0394|
|college_physics                    |      0|none  |     5|acc   |0.3431|±  |0.0472|
|college_medicine                   |      0|none  |     5|acc   |0.2312|±  |0.0321|
|college_mathematics                |      0|none  |     5|acc   |0.1800|±  |0.0386|
|college_computer_science           |      0|none  |     5|acc   |0.3000|±  |0.0461|
|college_chemistry                  |      0|none  |     5|acc   |0.2900|±  |0.0456|
|college_biology                    |      0|none  |     5|acc   |0.2083|±  |0.0340|
|clinical_knowledge                 |      0|none  |     5|acc   |0.2038|±  |0.0248|
|business_ethics                    |      0|none  |     5|acc   |0.2100|±  |0.0409|
|astronomy                          |      0|none  |     5|acc   |0.1908|±  |0.0320|
|anatomy                            |      0|none  |     5|acc   |0.2963|±  |0.0394|
|abstract_algebra                   |      0|none  |     5|acc   |0.2000|±  |0.0402|