--- base_model: gpt2 library_name: Distily license: mit tags: - generated_from_trainer model-index: - name: distily_bench_obj_cross_v2.11_gpt2 results: [] --- # distily_bench_obj_cross_v2.11_gpt2 This student model is distilled from the teacher model [gpt2](https://huggingface.co/gpt2) using the dataset (unspecified). The [Distily](https://github.com/lapp0/distily) library was used for this distillation. It achieves the following results on the evaluation set: - eval_enwikippl: 840.1149 - eval_frwikippl: 528.4605 - eval_zhwikippl: 126.6330 - eval_tinystoriesppl: 1037.4924 - eval_loss: 0.5100 - eval_runtime: 21.5094 - eval_samples_per_second: 46.491 - eval_steps_per_second: 11.623 ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None)) - train_embeddings: True - learning_rate: 4e-05 - train_batch_size: 1 - eval_batch_size: 4 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 1.0 ### Resource Usage Peak GPU Memory: 3.9285 GB ### Eval-Phase Metrics | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | **teacher eval** | | 270.2348 | 76.8142 | | | | | 671.1238 | 22.8030 | | 0 | 0 | 120078.375 | 1867851235328.0 | 19.4492 | 21.0652 | 47.472 | 11.868 | 72.8770 | 4013754155008.0 | | 5000 | 0.0505 | 1216.0441 | 888.1107 | 0.7144 | 21.4135 | 46.7 | 11.675 | 1267.6812 | 332.8297 | | 10000 | 0.1010 | 1162.2788 | 799.4963 | 0.6619 | 21.4269 | 46.67 | 11.668 | 1249.7319 | 438.5025 | | 15000 | 0.1515 | 980.3101 | 668.6794 | 0.6395 | 21.4739 | 46.568 | 11.642 | 1056.4025 | 425.3380 | | 20000 | 0.2020 | 1064.2865 | 759.8051 | 0.6318 | 21.4643 | 46.589 | 11.647 | 1151.2905 | 311.5830 | | 25000 | 0.2525 | 916.0289 | 621.8902 | 0.5662 | 21.1368 | 47.311 | 11.828 | 1071.6635 | 190.3806 | | 30000 | 0.3030 | 891.1293 | 582.2575 | 0.5445 | 21.4338 | 46.655 | 11.664 | 1072.1951 | 208.7082 | | 35000 | 0.3535 | 886.6196 | 544.0957 | 0.5381 | 21.5335 | 46.439 | 11.61 | 1057.8008 | 142.8915 | | 40000 | 0.4040 | 880.1868 | 549.4098 | 0.5349 | 21.4687 | 46.58 | 11.645 | 1076.1021 | 142.8439 | | 45000 | 0.4545 | 868.9573 | 564.4311 | 0.5323 | 21.4349 | 46.653 | 11.663 | 1042.4788 | 161.4311 | | 50000 | 0.5051 | 877.1919 | 541.3246 | 0.5320 | 21.548 | 46.408 | 11.602 | 1058.0631 | 167.7873 | | 55000 | 0.5556 | 869.4625 | 543.6743 | 0.5313 | 21.4821 | 46.55 | 11.638 | 1043.7725 | 163.6863 | | 60000 | 0.6061 | 872.2788 | 553.3121 | 0.5305 | 21.4316 | 46.66 | 11.665 | 1068.5228 | 141.9700 | | 65000 | 0.6566 | 833.5512 | 524.0497 | 0.5156 | 21.1637 | 47.251 | 11.813 | 1028.6963 | 137.2677 | | 70000 | 0.7071 | 837.5645 | 523.4596 | 0.5133 | 21.4101 | 46.707 | 11.677 | 1031.1652 | 124.3812 | | 75000 | 0.7576 | 847.7309 | 523.0175 | 0.5129 | 21.1745 | 47.227 | 11.807 | 1047.8357 | 130.6221 | | 80000 | 0.8081 | 843.6693 | 534.2609 | 0.5125 | 21.388 | 46.755 | 11.689 | 1040.4556 | 125.4979 | | 85000 | 0.8586 | 843.2120 | 524.1607 | 0.5106 | 21.4851 | 46.544 | 11.636 | 1042.5220 | 126.1609 | | 90000 | 0.9091 | 842.1672 | 529.2425 | 0.5101 | 21.4494 | 46.621 | 11.655 | 1040.6277 | 126.7345 | | 95000 | 0.9596 | 838.0835 | 528.3859 | 0.5099 | 21.1216 | 47.345 | 11.836 | 1034.5377 | 126.5655 | | 99000 | 1.0 | 840.1149 | 528.4605 | 0.5100 | 21.5094 | 46.491 | 11.623 | 1037.4924 | 126.6330 | ### Framework versions - Distily 0.2.0 - Transformers 4.44.0 - Pytorch 2.3.0 - Datasets 2.21.0