lapp0 commited on
Commit
78196d3
1 Parent(s): 8091005

End of training

Browse files
Files changed (13) hide show
  1. README.md +44 -21
  2. benchmarks.shelve.bak +1 -0
  3. benchmarks.shelve.dat +0 -0
  4. benchmarks.shelve.dir +1 -0
  5. logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee +3 -0
  6. logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee +3 -0
  7. logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee +3 -0
  8. logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee +3 -0
  9. logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee +3 -0
  10. logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee +3 -0
  11. logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727459787.1c1a426a2fee +3 -0
  12. logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee +3 -0
  13. tokenizer.json +2 -14
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  base_model: HuggingFaceTB/SmolLM-135M
3
  datasets:
4
- - HuggingFaceFW/fineweb-edu
5
  library_name: Distily
6
  license: creativeml-openrail-m
7
  tags:
@@ -18,7 +18,7 @@ model-index:
18
 
19
  Distilled with [Distily](https://github.com/lapp0/distily) library
20
  using teacher model [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M)
21
- on dataset [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu).
22
 
23
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
24
  should probably proofread and complete it, then remove this comment.
@@ -81,20 +81,21 @@ LlamaForCausalLM(
81
  - student 3: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8`
82
  - student 4: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8`
83
  - student 5: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8`
84
-
85
- | Metric | teacher | student 0 | student 1 | student 2 | student 3 | student 4 | student 5 |
86
- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
87
- | tinyArc.acc_norm,none | 0.37 | 0.303 | 0.295 | 0.302 | 0.26 | 0.269 | **0.319** |
88
- | tinyGSM8k.exact_match,flexible-extract | 0.006 | 0.029 | **0.03** | 0.025 | 0.006 | 0.006 | 0.012 |
89
- | tinyGSM8k.exact_match,strict-match | 0.006 | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** |
90
- | tinyHellaswag.acc_norm,none | 0.452 | **0.341** | 0.281 | 0.327 | 0.3 | 0.303 | 0.301 |
91
- | tinyMMLU.acc_norm,none | 0.341 | 0.276 | 0.281 | **0.31** | 0.286 | 0.279 | 0.292 |
92
- | tinyTruthfulQA.acc,none | 0.38 | **0.463** | 0.447 | 0.423 | 0.419 | 0.421 | 0.427 |
93
- | tinyWinogrande.acc_norm,none | 0.509 | 0.466 | 0.436 | 0.46 | **0.492** | 0.473 | 0.417 |
 
94
 
95
  # Resource Usage
96
 
97
- - Max Train VRAM Use: 13.1273 GB
98
  - Available VRAM: 23.4329 GB
99
  - GPUs:
100
  - 1x NVIDIA GeForce RTX 4090
@@ -124,6 +125,28 @@ LlamaForCausalLM(
124
  (self_attn): LlamaSdpaAttention(
125
  (q_proj): Linear(in_features=576, out_features=576, bias=False)
126
  (k_proj): Linear(in_features=576, out_features=192, bias=False)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
 
128
  ```
129
 
@@ -131,10 +154,10 @@ LlamaForCausalLM(
131
  <br/>
132
 
133
  # Train Dataset
134
- Trained on 640,425,804 tokens from the [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) dataset.
135
 
136
- - Num Samples: `998,000`
137
- - Subset: `sample-10BT`
138
  - Split: `train`
139
 
140
 
@@ -161,7 +184,7 @@ The following hyperparameters were used during training:
161
  <details>
162
  <summary>Expand</summary>
163
 
164
- - learning_rate: `6e-05`
165
  - train_batch_size: `8`
166
  - eval_batch_size: `4`
167
  - seed: `42`
@@ -181,7 +204,7 @@ The following hyperparameters were used during training:
181
  weight=0
182
  )
183
  )`
184
- - lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7d824cbaf4f0>`
185
  - student_model_name_or_path: `None`
186
  - student_config_name_or_path: `None`
187
  - student_model_config: `{'num_hidden_layers': 15}`
@@ -192,11 +215,11 @@ The following hyperparameters were used during training:
192
  - teacher_model_name_or_path: `HuggingFaceTB/SmolLM-135M`
193
  - teacher_load_in_8bit: `False`
194
  - teacher_load_in_4bit: `False`
195
- - dataset_uri: `HuggingFaceFW/fineweb-edu`
196
- - dataset_subset: `sample-10BT`
197
  - dataset_split: `train`
198
  - dataset_column_name: `text`
199
- - dataset_sample_size: `1000000`
200
  - dataset_max_seq_length: `1024`
201
  - dataset_test_size: `0.002`
202
  - dataset_shuffle: `False`
 
1
  ---
2
  base_model: HuggingFaceTB/SmolLM-135M
3
  datasets:
4
+ - wikimedia/wikipedia
5
  library_name: Distily
6
  license: creativeml-openrail-m
7
  tags:
 
18
 
19
  Distilled with [Distily](https://github.com/lapp0/distily) library
20
  using teacher model [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M)
21
+ on dataset [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia).
22
 
23
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
24
  should probably proofread and complete it, then remove this comment.
 
81
  - student 3: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8`
82
  - student 4: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8`
83
  - student 5: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8`
84
+ - student 6: `dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8`
85
+
86
+ | Metric | teacher | student 0 | student 1 | student 2 | student 3 | student 4 | student 5 | student 6 |
87
+ | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
88
+ | tinyArc.acc_norm,none | 0.37 | 0.303 | 0.295 | 0.302 | 0.26 | 0.269 | **0.319** | 0.286 |
89
+ | tinyGSM8k.exact_match,flexible-extract | 0.006 | 0.029 | **0.03** | 0.025 | 0.006 | 0.006 | 0.012 | 0.012 |
90
+ | tinyGSM8k.exact_match,strict-match | 0.006 | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** |
91
+ | tinyHellaswag.acc_norm,none | 0.452 | 0.341 | 0.281 | 0.327 | 0.3 | 0.303 | 0.301 | **0.364** |
92
+ | tinyMMLU.acc_norm,none | 0.341 | 0.276 | 0.281 | **0.31** | 0.286 | 0.279 | 0.292 | 0.295 |
93
+ | tinyTruthfulQA.acc,none | 0.38 | **0.463** | 0.447 | 0.423 | 0.419 | 0.421 | 0.427 | 0.44 |
94
+ | tinyWinogrande.acc_norm,none | 0.509 | 0.466 | 0.436 | 0.46 | **0.492** | 0.473 | 0.417 | 0.439 |
95
 
96
  # Resource Usage
97
 
98
+ - Max Train VRAM Use: 13.1269 GB
99
  - Available VRAM: 23.4329 GB
100
  - GPUs:
101
  - 1x NVIDIA GeForce RTX 4090
 
125
  (self_attn): LlamaSdpaAttention(
126
  (q_proj): Linear(in_features=576, out_features=576, bias=False)
127
  (k_proj): Linear(in_features=576, out_features=192, bias=False)
128
+ @@ -10,17 +10,16 @@
129
+ (o_proj): Linear(in_features=576, out_features=576, bias=False)
130
+ (rotary_emb): LlamaRotaryEmbedding()
131
+ )
132
+ - (mlp): LlamaMLP(
133
+ + (mlp): LigerSwiGLUMLP(
134
+ (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
135
+ (up_proj): Linear(in_features=576, out_features=1536, bias=False)
136
+ (down_proj): Linear(in_features=1536, out_features=576, bias=False)
137
+ - (act_fn): SiLU()
138
+ )
139
+ - (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
140
+ - (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
141
+ + (input_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
142
+ + (post_attention_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
143
+ )
144
+ )
145
+ - (norm): LlamaRMSNorm((576,), eps=1e-05)
146
+ + (norm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
147
+ (rotary_emb): LlamaRotaryEmbedding()
148
+ )
149
+ (lm_head): Linear(in_features=576, out_features=49152, bias=False)
150
 
151
  ```
152
 
 
154
  <br/>
155
 
156
  # Train Dataset
157
+ Trained on 1,857,293,914 tokens from the [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) dataset.
158
 
159
+ - Num Samples: `3,992,000`
160
+ - Subset: `20231101.en`
161
  - Split: `train`
162
 
163
 
 
184
  <details>
185
  <summary>Expand</summary>
186
 
187
+ - learning_rate: `0.0001`
188
  - train_batch_size: `8`
189
  - eval_batch_size: `4`
190
  - seed: `42`
 
204
  weight=0
205
  )
206
  )`
207
+ - lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x766de39d92d0>`
208
  - student_model_name_or_path: `None`
209
  - student_config_name_or_path: `None`
210
  - student_model_config: `{'num_hidden_layers': 15}`
 
215
  - teacher_model_name_or_path: `HuggingFaceTB/SmolLM-135M`
216
  - teacher_load_in_8bit: `False`
217
  - teacher_load_in_4bit: `False`
218
+ - dataset_uri: `wikimedia/wikipedia`
219
+ - dataset_subset: `20231101.en`
220
  - dataset_split: `train`
221
  - dataset_column_name: `text`
222
+ - dataset_sample_size: `4000000`
223
  - dataset_max_seq_length: `1024`
224
  - dataset_test_size: `0.002`
225
  - dataset_shuffle: `False`
benchmarks.shelve.bak CHANGED
@@ -5,3 +5,4 @@
5
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
6
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
7
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
 
 
5
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
6
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
7
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
8
+ 'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (3584, 448)
benchmarks.shelve.dat CHANGED
Binary files a/benchmarks.shelve.dat and b/benchmarks.shelve.dat differ
 
benchmarks.shelve.dir CHANGED
@@ -5,3 +5,4 @@
5
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
6
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
7
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
 
 
5
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
6
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
7
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
8
+ 'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8', (3584, 448)
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ecb25697ef7473b7b8b5667c93f6875e0df7d2d8d51da0be13197fd00326eb29
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c2fbbc999fdcdf1468846bc314f79a1c8adccdf843e07328638a3735b3a093cd
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0ca1b09a384e70ad76ad81e675af99c09aaa0191fd62338bf082102f390bce91
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ad1ce823ddaf3c3e06a9a6b7c1dbf7ec6d4f98433cc0de7b677cddc18e3ad5e2
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8b8e4a1d6b2bc49849d6254e186ae8f0645b868c943c576edebfd67c9efdd853
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6f626ce7e1138affbdf03f1c7c32a4f9935b13ee7d190d3c277b04761c26625b
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727459787.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9db57ae95333d43e007d253ff414f8364bd7568da1de63554ee5e1dcbd09f338
3
+ size 529
logs/dataset_max_seq_length=1024, dataset_sample_size=4000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727460185.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a065d3f14191df20143698661ca8115aecd472b34caf6ef7848f74d8000a707f
3
+ size 562
tokenizer.json CHANGED
@@ -1,19 +1,7 @@
1
  {
2
  "version": "1.0",
3
- "truncation": {
4
- "direction": "Right",
5
- "max_length": 1023,
6
- "strategy": "LongestFirst",
7
- "stride": 0
8
- },
9
- "padding": {
10
- "strategy": "BatchLongest",
11
- "direction": "Right",
12
- "pad_to_multiple_of": null,
13
- "pad_id": 0,
14
- "pad_type_id": 0,
15
- "pad_token": "<|endoftext|>"
16
- },
17
  "added_tokens": [
18
  {
19
  "id": 0,
 
1
  {
2
  "version": "1.0",
3
+ "truncation": null,
4
+ "padding": null,
 
 
 
 
 
 
 
 
 
 
 
 
5
  "added_tokens": [
6
  {
7
  "id": 0,