distily_bench_obj_cross_v2.9

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

Training procedure

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 1
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1.0

Peak GPU Memory: 6.6064 GB

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		169.9865	47377.9414					3.9789	4998.1294
0	0	25306.5312	80342.6562	6.4738	6.54	76.453	9.633	14565.9658	71518.8438
5000	0.1010	104.6652	13772.8643	0.7819	6.5261	76.616	9.654	5.5260	74161.4531
10000	0.2020	144.5553	13569.7109	0.7842	6.5111	76.792	9.676	8.9185	62270.8359
15000	0.3030	105.4526	12598.8818	0.7708	6.5186	76.704	9.665	5.6194	53872.625
20000	0.4040	121.5509	12060.5781	0.7610	6.5194	76.694	9.663	7.1313	52133.4336
25000	0.5051	111.6548	13016.4775	0.7537	6.5166	76.727	9.668	6.0700	53485.9688
30000	0.6061	101.6823	11441.6719	0.7577	6.5294	76.577	9.649	5.7104	48007.9258
35000	0.7071	97.8760	10992.2207	0.7519	6.5151	76.745	9.67	5.5543	47549.0430
40000	0.8081	114.8546	11104.2744	0.7378	6.5634	76.18	9.599	6.9089	42804.5273
45000	0.9091	112.3336	11524.9678	0.7228	6.5648	76.164	9.597	6.6096	46781.4727
49500	1.0	110.1023	10899.7119	0.7097	6.5487	76.351	9.62	6.5616	49450.9141