lapp0 commited on
Commit
3c993aa
1 Parent(s): 8b81053

End of training

Browse files
Files changed (12) hide show
  1. README.md +17 -38
  2. benchmarks.shelve.bak +1 -0
  3. benchmarks.shelve.dat +0 -0
  4. benchmarks.shelve.dir +1 -0
  5. logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee +3 -0
  6. logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee +3 -0
  7. logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee +3 -0
  8. logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee +3 -0
  9. logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727333565.1c1a426a2fee +3 -0
  10. logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee +3 -0
  11. logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee +3 -0
  12. tokenizer.json +2 -14
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  base_model: HuggingFaceTB/SmolLM-135M
3
  datasets:
4
- - HuggingFaceFW/fineweb
5
  library_name: Distily
6
  license: creativeml-openrail-m
7
  tags:
@@ -18,7 +18,7 @@ model-index:
18
 
19
  Distilled with [Distily](https://github.com/lapp0/distily) library
20
  using teacher model [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M)
21
- on dataset [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb).
22
 
23
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
24
  should probably proofread and complete it, then remove this comment.
@@ -80,20 +80,21 @@ LlamaForCausalLM(
80
  - student 2: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8`
81
  - student 3: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8`
82
  - student 4: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8`
83
-
84
- | Metric | teacher | student 0 | student 1 | student 2 | student 3 | student 4 |
85
- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
86
- | tinyArc.acc_norm,none | 0.37 | **0.303** | 0.295 | 0.302 | 0.26 | 0.269 |
87
- | tinyGSM8k.exact_match,flexible-extract | 0.006 | 0.029 | **0.03** | 0.025 | 0.006 | 0.006 |
88
- | tinyGSM8k.exact_match,strict-match | 0.006 | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** |
89
- | tinyHellaswag.acc_norm,none | 0.452 | **0.341** | 0.281 | 0.327 | 0.3 | 0.303 |
90
- | tinyMMLU.acc_norm,none | 0.341 | 0.276 | 0.281 | **0.31** | 0.286 | 0.279 |
91
- | tinyTruthfulQA.acc,none | 0.38 | **0.463** | 0.447 | 0.423 | 0.419 | 0.421 |
92
- | tinyWinogrande.acc_norm,none | 0.509 | 0.466 | 0.436 | 0.46 | **0.492** | 0.473 |
 
93
 
94
  # Resource Usage
95
 
96
- - Max Train VRAM Use: 13.1269 GB
97
  - Available VRAM: 23.4329 GB
98
  - GPUs:
99
  - 1x NVIDIA GeForce RTX 4090
@@ -123,28 +124,6 @@ LlamaForCausalLM(
123
  (self_attn): LlamaSdpaAttention(
124
  (q_proj): Linear(in_features=576, out_features=576, bias=False)
125
  (k_proj): Linear(in_features=576, out_features=192, bias=False)
126
- @@ -10,17 +10,16 @@
127
- (o_proj): Linear(in_features=576, out_features=576, bias=False)
128
- (rotary_emb): LlamaRotaryEmbedding()
129
- )
130
- - (mlp): LlamaMLP(
131
- + (mlp): LigerSwiGLUMLP(
132
- (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
133
- (up_proj): Linear(in_features=576, out_features=1536, bias=False)
134
- (down_proj): Linear(in_features=1536, out_features=576, bias=False)
135
- - (act_fn): SiLU()
136
- )
137
- - (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
138
- - (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
139
- + (input_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
140
- + (post_attention_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
141
- )
142
- )
143
- - (norm): LlamaRMSNorm((576,), eps=1e-05)
144
- + (norm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
145
- (rotary_emb): LlamaRotaryEmbedding()
146
- )
147
- (lm_head): Linear(in_features=576, out_features=49152, bias=False)
148
 
149
  ```
150
 
@@ -152,7 +131,7 @@ LlamaForCausalLM(
152
  <br/>
153
 
154
  # Train Dataset
155
- Trained on 501,158,307 tokens from the [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset.
156
 
157
  - Num Samples: `998,000`
158
  - Subset: `sample-10BT`
@@ -202,7 +181,7 @@ The following hyperparameters were used during training:
202
  weight=0
203
  )
204
  )`
205
- - lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7d820438ae60>`
206
  - student_model_name_or_path: `None`
207
  - student_config_name_or_path: `None`
208
  - student_model_config: `{'num_hidden_layers': 15}`
@@ -213,7 +192,7 @@ The following hyperparameters were used during training:
213
  - teacher_model_name_or_path: `HuggingFaceTB/SmolLM-135M`
214
  - teacher_load_in_8bit: `False`
215
  - teacher_load_in_4bit: `False`
216
- - dataset_uri: `HuggingFaceFW/fineweb`
217
  - dataset_subset: `sample-10BT`
218
  - dataset_split: `train`
219
  - dataset_column_name: `text`
 
1
  ---
2
  base_model: HuggingFaceTB/SmolLM-135M
3
  datasets:
4
+ - HuggingFaceFW/fineweb-edu
5
  library_name: Distily
6
  license: creativeml-openrail-m
7
  tags:
 
18
 
19
  Distilled with [Distily](https://github.com/lapp0/distily) library
20
  using teacher model [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M)
21
+ on dataset [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu).
22
 
23
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
24
  should probably proofread and complete it, then remove this comment.
 
80
  - student 2: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8`
81
  - student 3: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8`
82
  - student 4: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8`
83
+ - student 5: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8`
84
+
85
+ | Metric | teacher | student 0 | student 1 | student 2 | student 3 | student 4 | student 5 |
86
+ | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
87
+ | tinyArc.acc_norm,none | 0.37 | 0.303 | 0.295 | 0.302 | 0.26 | 0.269 | **0.319** |
88
+ | tinyGSM8k.exact_match,flexible-extract | 0.006 | 0.029 | **0.03** | 0.025 | 0.006 | 0.006 | 0.012 |
89
+ | tinyGSM8k.exact_match,strict-match | 0.006 | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** |
90
+ | tinyHellaswag.acc_norm,none | 0.452 | **0.341** | 0.281 | 0.327 | 0.3 | 0.303 | 0.301 |
91
+ | tinyMMLU.acc_norm,none | 0.341 | 0.276 | 0.281 | **0.31** | 0.286 | 0.279 | 0.292 |
92
+ | tinyTruthfulQA.acc,none | 0.38 | **0.463** | 0.447 | 0.423 | 0.419 | 0.421 | 0.427 |
93
+ | tinyWinogrande.acc_norm,none | 0.509 | 0.466 | 0.436 | 0.46 | **0.492** | 0.473 | 0.417 |
94
 
95
  # Resource Usage
96
 
97
+ - Max Train VRAM Use: 13.1273 GB
98
  - Available VRAM: 23.4329 GB
99
  - GPUs:
100
  - 1x NVIDIA GeForce RTX 4090
 
124
  (self_attn): LlamaSdpaAttention(
125
  (q_proj): Linear(in_features=576, out_features=576, bias=False)
126
  (k_proj): Linear(in_features=576, out_features=192, bias=False)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
 
128
  ```
129
 
 
131
  <br/>
132
 
133
  # Train Dataset
134
+ Trained on 640,425,804 tokens from the [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) dataset.
135
 
136
  - Num Samples: `998,000`
137
  - Subset: `sample-10BT`
 
181
  weight=0
182
  )
183
  )`
184
+ - lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7d824cbaf4f0>`
185
  - student_model_name_or_path: `None`
186
  - student_config_name_or_path: `None`
187
  - student_model_config: `{'num_hidden_layers': 15}`
 
192
  - teacher_model_name_or_path: `HuggingFaceTB/SmolLM-135M`
193
  - teacher_load_in_8bit: `False`
194
  - teacher_load_in_4bit: `False`
195
+ - dataset_uri: `HuggingFaceFW/fineweb-edu`
196
  - dataset_subset: `sample-10BT`
197
  - dataset_split: `train`
198
  - dataset_column_name: `text`
benchmarks.shelve.bak CHANGED
@@ -4,3 +4,4 @@
4
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
5
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
6
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
 
 
4
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
5
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
6
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
7
+ 'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
benchmarks.shelve.dat CHANGED
Binary files a/benchmarks.shelve.dat and b/benchmarks.shelve.dat differ
 
benchmarks.shelve.dir CHANGED
@@ -4,3 +4,4 @@
4
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
5
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
6
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
 
 
4
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
5
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
6
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
7
+ 'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d78b57ac043ee94d05e8c1ba184e929678593bf39dee76cc173adacd4357a137
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:950a2485764d9a8707289ae5e36dcd0f106bad33b5437d5e88753778f1282ab5
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:46f4f0f49ae412d50e473e16ee9ba0d9c9ffba01a96132b9da302e8ed89e83ba
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:61f8d3c58bc2c445f6add695c17231a1d6aa44f075e314f683f07998d6e7603b
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727333565.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3950b20d235aab15fd629f63779e509d8fa68a67d64198ccde410d368bab2fa5
3
+ size 529
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:070cb12f348bade560a253ed036f705331430f20e2c74309b499f372eb402607
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c61cd09949915a47af4cef46db34250f1ba2e1f1a56dc7b5fa1cc44f21a1eb0
3
+ size 562
tokenizer.json CHANGED
@@ -1,19 +1,7 @@
1
  {
2
  "version": "1.0",
3
- "truncation": {
4
- "direction": "Right",
5
- "max_length": 1023,
6
- "strategy": "LongestFirst",
7
- "stride": 0
8
- },
9
- "padding": {
10
- "strategy": "BatchLongest",
11
- "direction": "Right",
12
- "pad_to_multiple_of": null,
13
- "pad_id": 0,
14
- "pad_type_id": 0,
15
- "pad_token": "<|endoftext|>"
16
- },
17
  "added_tokens": [
18
  {
19
  "id": 0,
 
1
  {
2
  "version": "1.0",
3
+ "truncation": null,
4
+ "padding": null,
 
 
 
 
 
 
 
 
 
 
 
 
5
  "added_tokens": [
6
  {
7
  "id": 0,