metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.11_gpt2
    results: []

distily_bench_obj_cross_v2.11_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 840.1149
eval_frwikippl: 528.4605
eval_zhwikippl: 126.6330
eval_tinystoriesppl: 1037.4924
eval_loss: 0.5100
eval_runtime: 21.5094
eval_samples_per_second: 46.491
eval_steps_per_second: 11.623

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 1
eval_batch_size: 4
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 3.9285 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		270.2348	76.8142					671.1238	22.8030
0	0	120078.375	1867851235328.0	19.4492	21.0652	47.472	11.868	72.8770	4013754155008.0
5000	0.0505	1216.0441	888.1107	0.7144	21.4135	46.7	11.675	1267.6812	332.8297
10000	0.1010	1162.2788	799.4963	0.6619	21.4269	46.67	11.668	1249.7319	438.5025
15000	0.1515	980.3101	668.6794	0.6395	21.4739	46.568	11.642	1056.4025	425.3380
20000	0.2020	1064.2865	759.8051	0.6318	21.4643	46.589	11.647	1151.2905	311.5830
25000	0.2525	916.0289	621.8902	0.5662	21.1368	47.311	11.828	1071.6635	190.3806
30000	0.3030	891.1293	582.2575	0.5445	21.4338	46.655	11.664	1072.1951	208.7082
35000	0.3535	886.6196	544.0957	0.5381	21.5335	46.439	11.61	1057.8008	142.8915
40000	0.4040	880.1868	549.4098	0.5349	21.4687	46.58	11.645	1076.1021	142.8439
45000	0.4545	868.9573	564.4311	0.5323	21.4349	46.653	11.663	1042.4788	161.4311
50000	0.5051	877.1919	541.3246	0.5320	21.548	46.408	11.602	1058.0631	167.7873
55000	0.5556	869.4625	543.6743	0.5313	21.4821	46.55	11.638	1043.7725	163.6863
60000	0.6061	872.2788	553.3121	0.5305	21.4316	46.66	11.665	1068.5228	141.9700
65000	0.6566	833.5512	524.0497	0.5156	21.1637	47.251	11.813	1028.6963	137.2677
70000	0.7071	837.5645	523.4596	0.5133	21.4101	46.707	11.677	1031.1652	124.3812
75000	0.7576	847.7309	523.0175	0.5129	21.1745	47.227	11.807	1047.8357	130.6221
80000	0.8081	843.6693	534.2609	0.5125	21.388	46.755	11.689	1040.4556	125.4979
85000	0.8586	843.2120	524.1607	0.5106	21.4851	46.544	11.636	1042.5220	126.1609
90000	0.9091	842.1672	529.2425	0.5101	21.4494	46.621	11.655	1040.6277	126.7345
95000	0.9596	838.0835	528.3859	0.5099	21.1216	47.345	11.836	1034.5377	126.5655
99000	1.0	840.1149	528.4605	0.5100	21.5094	46.491	11.623	1037.4924	126.6330

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0