lapp0 commited on
Commit
a07e465
1 Parent(s): fb2f4e5

End of training

Browse files
README.md CHANGED
@@ -75,15 +75,21 @@ LlamaForCausalLM(
75
 
76
  # Benchmark Metrics Comparison
77
 
78
- | Metric | distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8 | distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8 | distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8 | distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8 | logs/teacher |
79
- | :--- | :--- | :--- | :--- | :--- | :--- |
80
- | tinyArc.acc_norm,none | 0.303 | 0.295 | 0.26 | 0.302 | 0.37 |
81
- | tinyGSM8k.exact_match,flexible-extract | 0.029 | 0.03 | 0.006 | 0.025 | 0.006 |
82
- | tinyGSM8k.exact_match,strict-match | 0.006 | 0.006 | 0.006 | 0.006 | 0.006 |
83
- | tinyHellaswag.acc_norm,none | 0.341 | 0.281 | 0.3 | 0.327 | 0.452 |
84
- | tinyMMLU.acc_norm,none | 0.276 | 0.281 | 0.286 | 0.31 | 0.341 |
85
- | tinyTruthfulQA.acc,none | 0.463 | 0.447 | 0.419 | 0.423 | 0.38 |
86
- | tinyWinogrande.acc_norm,none | 0.466 | 0.436 | 0.492 | 0.46 | 0.509 |
 
 
 
 
 
 
87
 
88
  # Resource Usage
89
 
@@ -146,7 +152,7 @@ LlamaForCausalLM(
146
  <br/>
147
 
148
  # Train Dataset
149
- Trained on 501,164,413 tokens from the [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset.
150
 
151
  - Num Samples: `998,000`
152
  - Subset: `sample-10BT`
@@ -176,7 +182,7 @@ The following hyperparameters were used during training:
176
  <details>
177
  <summary>Expand</summary>
178
 
179
- - learning_rate: `0.0001`
180
  - train_batch_size: `8`
181
  - eval_batch_size: `4`
182
  - seed: `42`
@@ -196,7 +202,7 @@ The following hyperparameters were used during training:
196
  weight=0
197
  )
198
  )`
199
- - lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7205cc5db070>`
200
  - student_model_name_or_path: `None`
201
  - student_config_name_or_path: `None`
202
  - student_model_config: `{'num_hidden_layers': 15}`
 
75
 
76
  # Benchmark Metrics Comparison
77
 
78
+ - student 0: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8`
79
+ - student 1: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8`
80
+ - student 2: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8`
81
+ - student 3: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8`
82
+ - student 4: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8`
83
+
84
+ | Metric | teacher | student 0 | student 1 | student 2 | student 3 | student 4 |
85
+ | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
86
+ | tinyArc.acc_norm,none | 0.37 | **0.303** | 0.295 | 0.302 | 0.26 | 0.269 |
87
+ | tinyGSM8k.exact_match,flexible-extract | 0.006 | 0.029 | **0.03** | 0.025 | 0.006 | 0.006 |
88
+ | tinyGSM8k.exact_match,strict-match | 0.006 | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** |
89
+ | tinyHellaswag.acc_norm,none | 0.452 | **0.341** | 0.281 | 0.327 | 0.3 | 0.303 |
90
+ | tinyMMLU.acc_norm,none | 0.341 | 0.276 | 0.281 | **0.31** | 0.286 | 0.279 |
91
+ | tinyTruthfulQA.acc,none | 0.38 | **0.463** | 0.447 | 0.423 | 0.419 | 0.421 |
92
+ | tinyWinogrande.acc_norm,none | 0.509 | 0.466 | 0.436 | 0.46 | **0.492** | 0.473 |
93
 
94
  # Resource Usage
95
 
 
152
  <br/>
153
 
154
  # Train Dataset
155
+ Trained on 501,158,307 tokens from the [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset.
156
 
157
  - Num Samples: `998,000`
158
  - Subset: `sample-10BT`
 
182
  <details>
183
  <summary>Expand</summary>
184
 
185
+ - learning_rate: `6e-05`
186
  - train_batch_size: `8`
187
  - eval_batch_size: `4`
188
  - seed: `42`
 
202
  weight=0
203
  )
204
  )`
205
+ - lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7d820438ae60>`
206
  - student_model_name_or_path: `None`
207
  - student_config_name_or_path: `None`
208
  - student_model_config: `{'num_hidden_layers': 15}`
benchmarks.shelve.bak CHANGED
@@ -3,3 +3,4 @@
3
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8', (1024, 448)
4
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
5
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
 
 
3
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8', (1024, 448)
4
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
5
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
6
+ 'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
benchmarks.shelve.dat CHANGED
Binary files a/benchmarks.shelve.dat and b/benchmarks.shelve.dat differ
 
benchmarks.shelve.dir CHANGED
@@ -3,3 +3,4 @@
3
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8', (1024, 448)
4
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
5
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
 
 
3
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8', (1024, 448)
4
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
5
  'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
6
+ 'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727305844.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3b5a710dcc579a86b0e77d95e438a45193434f75abd10440cbe8be03de4d0ead
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8/events.out.tfevents.1727305844.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cec2d93495bb721960016bb9c4c2e8d682079af90b5cbc7545c1c7d6e51bfd17
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727305455.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:14ee0b874c857320d8420d2ae889db321652e9b56742a486e22cd93f52b8e5de
3
+ size 529
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727305844.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6a8bbff5b04c03072b196adf81878e6c99cb4918f73f2ac8de533de7d7040018
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8/events.out.tfevents.1727305844.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ae5000114739e74ad5a1a1f2290ec572ae8baebb1eb21e65344cdf8bcf4d3e11
3
+ size 562
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8/events.out.tfevents.1727305844.1c1a426a2fee ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d8674cd466ef5e0e588df023720ce9449d34eabfcffe79520897b6af7318fbb1
3
+ size 562
tokenizer.json CHANGED
@@ -1,19 +1,7 @@
1
  {
2
  "version": "1.0",
3
- "truncation": {
4
- "direction": "Right",
5
- "max_length": 1023,
6
- "strategy": "LongestFirst",
7
- "stride": 0
8
- },
9
- "padding": {
10
- "strategy": "BatchLongest",
11
- "direction": "Right",
12
- "pad_to_multiple_of": null,
13
- "pad_id": 0,
14
- "pad_type_id": 0,
15
- "pad_token": "<|endoftext|>"
16
- },
17
  "added_tokens": [
18
  {
19
  "id": 0,
 
1
  {
2
  "version": "1.0",
3
+ "truncation": null,
4
+ "padding": null,
 
 
 
 
 
 
 
 
 
 
 
 
5
  "added_tokens": [
6
  {
7
  "id": 0,