distily_bench_gpt2_simple_objectives

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 495.9718
eval_frwikippl: 3345.0957
eval_zhwikippl: 2696.0598
eval_loss: 40.5622
eval_runtime: 34.3051
eval_samples_per_second: 58.3
eval_steps_per_second: 7.288

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: MultiObjective(logits_weight=1, logits_loss_fn=(fn:kl_divergence_loss()), activations_weight=0.2, activations_loss_fn=(fn:mse_loss()), attentions_weight=0, attentions_loss_fn=(fn:mse_loss()))
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.0893 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		30.2086	57.2728					18.1784
0	0	54069.2930	57285.3438	133.3560	34.3024	58.305	7.288	54227.1016
1000	0.0404	1437.1931	8480.3613	43.3980	34.5438	57.898	7.237	74395.6406
2000	0.0808	997.8790	5260.8130	42.3937	34.3703	58.19	7.274	33120.7461
3000	0.1212	830.5260	5152.1616	41.7965	34.348	58.228	7.278	11334.5342
4000	0.1616	745.0864	4422.4756	41.2595	34.4519	58.052	7.256	5651.1323
5000	0.2020	644.0798	4158.1821	41.1632	34.4631	58.033	7.254	4903.9395
6000	0.2424	592.7726	3791.3215	40.8778	34.3097	58.293	7.287	4353.2559
7000	0.2828	545.3409	3490.1353	40.8020	34.4207	58.105	7.263	3123.4839
8000	0.3232	519.2236	3238.8032	40.6310	34.2625	58.373	7.297	1952.6049
9000	0.3636	495.9718	3345.0957	40.5622	34.3051	58.3	7.288	2696.0598
10000	0.4040	482.7110	3048.2520	40.4688	34.3828	58.169	7.271	2027.5375
11000	0.4444	453.9180	2860.8340	40.3758	34.2822	58.339	7.292	2861.5081
12000	0.4848	441.7129	2985.2966	40.2887	34.2175	58.45	7.306	2510.5007
13000	0.5253	429.0357	2882.7014	40.1765	34.4175	58.11	7.264	6012.3589
14000	0.5657	416.6578	2756.2913	40.1022	34.4762	58.011	7.251	12478.4199
15000	0.6061	406.1163	2797.8003	40.0135	34.5042	57.964	7.246	6068.8252
16000	0.6465	405.6435	2525.5491	39.9328	34.3124	58.288	7.286	4309.2979
17000	0.6869	394.7977	2709.6606	39.9735	34.3165	58.281	7.285	2797.2800
18000	0.7273	397.4739	2544.8535	39.7368	34.4016	58.137	7.267	9888.5605
19000	0.7677	387.6284	2540.5505	39.7513	34.3493	58.225	7.278	5071.7769
20000	0.8081	378.9675	2503.9182	39.6105	34.4198	58.106	7.263	3492.3926
21000	0.8485	376.9130	2442.8845	39.5590	34.343	58.236	7.28	10077.8555
22000	0.8889	374.1136	2348.3101	39.5182	34.2953	58.317	7.29	3595.5537
23000	0.9293	368.7203	2389.3955	39.4282	34.6197	57.771	7.221	11663.1113
24000	0.9697	365.7831	2363.9253	39.4065	34.6468	57.725	7.216	5269.2183
24750	1.0	363.6872	2441.5068	39.3040	34.7181	57.607	7.201	2566.7729

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.20.0

lapp0
/

istily_bench_gpt2_simple_objectives