metadata

base_model: roneneldan/TinyStories-33M
library_name: Distily
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.10
    results: []

distily_bench_obj_cross_v2.10

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 108.1245
eval_frwikippl: 11043.4336
eval_zhwikippl: 55788.7734
eval_tinystoriesppl: 6.7037
eval_loss: 0.7047
eval_runtime: 13.0964
eval_samples_per_second: 76.357
eval_steps_per_second: 9.545

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 1e-05
train_batch_size: 1
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 6.6064 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		169.9865	47377.9414					3.9789	4998.1294
0	0	50480.5703	85684.4844	6.8305	13.0511	76.622	9.578	33932.0586	94692.1562
5000	0.0505	107.6398	10878.2363	0.7372	13.0368	76.706	9.588	6.6259	48821.6875
10000	0.1010	103.7832	10693.6533	0.7210	13.0383	76.697	9.587	6.3876	52904.0898
15000	0.1515	113.9463	10959.7607	0.7146	13.0455	76.655	9.582	7.3001	55833.4297
20000	0.2020	102.8906	10842.2969	0.7117	13.0448	76.659	9.582	6.3362	55967.6680
25000	0.2525	107.6648	11021.6855	0.7063	13.0457	76.654	9.582	6.7065	55654.9688
30000	0.3030	107.8986	11027.8887	0.7052	13.0423	76.673	9.584	6.6954	55122.9922
35000	0.3535	107.8986	10953.5859	0.7051	12.9974	76.939	9.617	6.6910	54771.1680
40000	0.4040	107.9989	10941.2451	0.7053	13.0736	76.49	9.561	6.7123	55122.9922
45000	0.4545	107.8317	10986.0273	0.7051	13.0495	76.632	9.579	6.7056	55064.1953
50000	0.5051	107.9905	11037.2217	0.7049	13.0288	76.753	9.594	6.7202	55922.9062
55000	0.5556	108.2753	10973.6602	0.7051	13.0751	76.481	9.56	6.7202	54917.4609
60000	0.6061	108.0324	11037.2217	0.7052	13.0104	76.861	9.608	6.7093	55358.7930
65000	0.6566	108.3089	11043.4336	0.7049	13.0425	76.673	9.584	6.7123	55122.9922
70000	0.7071	108.2418	11043.4336	0.7047	12.9968	76.942	9.618	6.7065	55122.9922
75000	0.7576	107.9069	11043.4336	0.7046	13.0103	76.862	9.608	6.7004	55506.7109
80000	0.8081	108.1915	11043.4336	0.7047	13.0166	76.825	9.603	6.6979	55788.7734
85000	0.8586	108.3089	11043.4336	0.7045	13.0625	76.555	9.569	6.7076	55759.0430
90000	0.9091	108.2083	11043.4336	0.7047	13.0397	76.689	9.586	6.7059	55788.7734
95000	0.9596	108.1999	11043.4336	0.7045	13.0487	76.636	9.579	6.7062	55788.7734
99000	1.0	108.1245	11043.4336	0.7047	13.0964	76.357	9.545	6.7037	55788.7734

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0