lapp0 commited on
Commit
06e2b60
1 Parent(s): b40b099

End of training

Browse files
README.md CHANGED
@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
- - eval_enwikippl: 132.5935
19
- - eval_frwikippl: 19405.3008
20
- - eval_zhwikippl: 53229.7070
21
- - eval_tinystoriesppl: 9.1860
22
- - eval_loss: 1.2126
23
- - eval_runtime: 13.0629
24
- - eval_samples_per_second: 76.553
25
- - eval_steps_per_second: 9.569
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
@@ -47,7 +47,7 @@ More information needed
47
  The following hyperparameters were used during training:
48
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
- - learning_rate: 4e-06
51
  - train_batch_size: 1
52
  - eval_batch_size: 8
53
  - seed: 42
@@ -62,27 +62,27 @@ Peak GPU Memory: 6.6064 GB
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
- | 0 | 0 | 50480.5703 | 85684.4844 | 6.8305 | 13.0395 | 76.69 | 9.586 | 33932.0586 | 94692.1562 |
66
- | 5000 | 0.0505 | 132.2037 | 19438.1211 | 1.2135 | 13.0309 | 76.741 | 9.593 | 9.1868 | 53144.5430 |
67
- | 10000 | 0.1010 | 132.4087 | 19372.5156 | 1.2127 | 13.048 | 76.64 | 9.58 | 9.1830 | 53201.2852 |
68
- | 15000 | 0.1515 | 132.5935 | 19416.2402 | 1.2128 | 13.0596 | 76.572 | 9.571 | 9.1887 | 53144.5430 |
69
- | 20000 | 0.2020 | 132.4292 | 19367.0664 | 1.2127 | 13.037 | 76.705 | 9.588 | 9.1955 | 53343.4375 |
70
- | 25000 | 0.2525 | 132.5113 | 19367.0664 | 1.2126 | 13.0647 | 76.542 | 9.568 | 9.1890 | 53258.0898 |
71
- | 30000 | 0.3030 | 132.5935 | 19383.4375 | 1.2125 | 13.0105 | 76.861 | 9.608 | 9.1845 | 53201.2852 |
72
- | 35000 | 0.3535 | 132.3267 | 19372.5156 | 1.2127 | 13.1134 | 76.258 | 9.532 | 9.1754 | 53229.7070 |
73
- | 40000 | 0.4040 | 132.5935 | 19367.0664 | 1.2127 | 13.0356 | 76.713 | 9.589 | 9.1928 | 53229.7070 |
74
- | 45000 | 0.4545 | 132.4908 | 19372.5156 | 1.2126 | 13.0611 | 76.563 | 9.57 | 9.1826 | 53258.0898 |
75
- | 50000 | 0.5051 | 132.2447 | 19405.3008 | 1.2126 | 13.07 | 76.511 | 9.564 | 9.1803 | 53286.5391 |
76
- | 55000 | 0.5556 | 132.6346 | 19405.3008 | 1.2126 | 13.0134 | 76.844 | 9.605 | 9.1917 | 53229.7070 |
77
- | 60000 | 0.6061 | 132.6346 | 19405.3008 | 1.2126 | 13.0453 | 76.656 | 9.582 | 9.1883 | 53258.0898 |
78
- | 65000 | 0.6566 | 132.6346 | 19394.3652 | 1.2126 | 13.0475 | 76.643 | 9.58 | 9.1928 | 53258.0898 |
79
- | 70000 | 0.7071 | 132.5935 | 19427.1680 | 1.2125 | 13.0602 | 76.568 | 9.571 | 9.1830 | 53229.7070 |
80
- | 75000 | 0.7576 | 132.4292 | 19405.3008 | 1.2126 | 13.0658 | 76.535 | 9.567 | 9.1788 | 53229.7070 |
81
- | 80000 | 0.8081 | 132.6346 | 19405.3008 | 1.2127 | 13.0497 | 76.63 | 9.579 | 9.1871 | 53229.7070 |
82
- | 85000 | 0.8586 | 132.5935 | 19405.3008 | 1.2126 | 13.0439 | 76.664 | 9.583 | 9.1879 | 53229.7070 |
83
- | 90000 | 0.9091 | 132.5935 | 19405.3008 | 1.2126 | 13.0368 | 76.706 | 9.588 | 9.1814 | 53229.7070 |
84
- | 95000 | 0.9596 | 132.5935 | 19405.3008 | 1.2126 | 13.0326 | 76.731 | 9.591 | 9.1868 | 53229.7070 |
85
- | 99000 | 1.0 | 132.5935 | 19405.3008 | 1.2126 | 13.0629 | 76.553 | 9.569 | 9.1860 | 53229.7070 |
86
 
87
  ### Framework versions
88
  - Distily 0.2.0
 
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
+ - eval_enwikippl: 108.1245
19
+ - eval_frwikippl: 11043.4336
20
+ - eval_zhwikippl: 55788.7734
21
+ - eval_tinystoriesppl: 6.7037
22
+ - eval_loss: 0.7047
23
+ - eval_runtime: 13.0964
24
+ - eval_samples_per_second: 76.357
25
+ - eval_steps_per_second: 9.545
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
 
47
  The following hyperparameters were used during training:
48
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
+ - learning_rate: 1e-05
51
  - train_batch_size: 1
52
  - eval_batch_size: 8
53
  - seed: 42
 
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
+ | 0 | 0 | 50480.5703 | 85684.4844 | 6.8305 | 13.0511 | 76.622 | 9.578 | 33932.0586 | 94692.1562 |
66
+ | 5000 | 0.0505 | 107.6398 | 10878.2363 | 0.7372 | 13.0368 | 76.706 | 9.588 | 6.6259 | 48821.6875 |
67
+ | 10000 | 0.1010 | 103.7832 | 10693.6533 | 0.7210 | 13.0383 | 76.697 | 9.587 | 6.3876 | 52904.0898 |
68
+ | 15000 | 0.1515 | 113.9463 | 10959.7607 | 0.7146 | 13.0455 | 76.655 | 9.582 | 7.3001 | 55833.4297 |
69
+ | 20000 | 0.2020 | 102.8906 | 10842.2969 | 0.7117 | 13.0448 | 76.659 | 9.582 | 6.3362 | 55967.6680 |
70
+ | 25000 | 0.2525 | 107.6648 | 11021.6855 | 0.7063 | 13.0457 | 76.654 | 9.582 | 6.7065 | 55654.9688 |
71
+ | 30000 | 0.3030 | 107.8986 | 11027.8887 | 0.7052 | 13.0423 | 76.673 | 9.584 | 6.6954 | 55122.9922 |
72
+ | 35000 | 0.3535 | 107.8986 | 10953.5859 | 0.7051 | 12.9974 | 76.939 | 9.617 | 6.6910 | 54771.1680 |
73
+ | 40000 | 0.4040 | 107.9989 | 10941.2451 | 0.7053 | 13.0736 | 76.49 | 9.561 | 6.7123 | 55122.9922 |
74
+ | 45000 | 0.4545 | 107.8317 | 10986.0273 | 0.7051 | 13.0495 | 76.632 | 9.579 | 6.7056 | 55064.1953 |
75
+ | 50000 | 0.5051 | 107.9905 | 11037.2217 | 0.7049 | 13.0288 | 76.753 | 9.594 | 6.7202 | 55922.9062 |
76
+ | 55000 | 0.5556 | 108.2753 | 10973.6602 | 0.7051 | 13.0751 | 76.481 | 9.56 | 6.7202 | 54917.4609 |
77
+ | 60000 | 0.6061 | 108.0324 | 11037.2217 | 0.7052 | 13.0104 | 76.861 | 9.608 | 6.7093 | 55358.7930 |
78
+ | 65000 | 0.6566 | 108.3089 | 11043.4336 | 0.7049 | 13.0425 | 76.673 | 9.584 | 6.7123 | 55122.9922 |
79
+ | 70000 | 0.7071 | 108.2418 | 11043.4336 | 0.7047 | 12.9968 | 76.942 | 9.618 | 6.7065 | 55122.9922 |
80
+ | 75000 | 0.7576 | 107.9069 | 11043.4336 | 0.7046 | 13.0103 | 76.862 | 9.608 | 6.7004 | 55506.7109 |
81
+ | 80000 | 0.8081 | 108.1915 | 11043.4336 | 0.7047 | 13.0166 | 76.825 | 9.603 | 6.6979 | 55788.7734 |
82
+ | 85000 | 0.8586 | 108.3089 | 11043.4336 | 0.7045 | 13.0625 | 76.555 | 9.569 | 6.7076 | 55759.0430 |
83
+ | 90000 | 0.9091 | 108.2083 | 11043.4336 | 0.7047 | 13.0397 | 76.689 | 9.586 | 6.7059 | 55788.7734 |
84
+ | 95000 | 0.9596 | 108.1999 | 11043.4336 | 0.7045 | 13.0487 | 76.636 | 9.579 | 6.7062 | 55788.7734 |
85
+ | 99000 | 1.0 | 108.1245 | 11043.4336 | 0.7047 | 13.0964 | 76.357 | 9.545 | 6.7037 | 55788.7734 |
86
 
87
  ### Framework versions
88
  - Distily 0.2.0
logs/copy_teacher_modules=_(_lm_head___False)_, learning_rate=1e-05/events.out.tfevents.1724034905.5f530b1cf724 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2eea5a1a055cbca68c599c9cf8f94a25640f16b99ecae80ab5647818f78fc18e
3
+ size 312