metadata

base_model: roneneldan/TinyStories-33M
library_name: Distily
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.10
    results: []

distily_bench_obj_cross_v2.10

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 12766.3359
eval_frwikippl: 57742.3438
eval_zhwikippl: 65334.25
eval_tinystoriesppl: 4770.0942
eval_loss: 5.2085
eval_runtime: 13.0328
eval_samples_per_second: 76.73
eval_steps_per_second: 9.591

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 1e-06
train_batch_size: 1
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 6.6048 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		169.9865	47377.9414					3.9789	4998.1294
0	0	61801.5039	81001.6719	6.4680	13.0128	76.847	9.606	44522.7852	75358.2109
5000	0.0505	12766.3359	57742.3438	5.2085	12.9999	76.923	9.615	4771.6733	65264.5664
10000	0.1010	12766.3359	57742.3438	5.2085	13.0144	76.838	9.605	4768.5161	65334.25
15000	0.1515	12766.3359	57742.3438	5.2085	13.0239	76.782	9.598	4770.0942	65334.25
20000	0.2020	12766.3359	57742.3438	5.2085	12.9909	76.977	9.622	4769.3076	65334.25
25000	0.2525	12766.3359	57709.8086	5.2083	13.1403	76.102	9.513	4768.5161	65334.25
30000	0.3030	12766.3359	57709.8086	5.2083	13.0382	76.698	9.587	4768.5161	65334.25
35000	0.3535	12766.3359	57742.3438	5.2083	13.0826	76.438	9.555	4770.0942	65334.25
40000	0.4040	12766.3359	57742.3438	5.2085	13.0472	76.645	9.581	4769.3076	65334.25
45000	0.4545	12766.3359	57742.3438	5.2085	13.1664	75.951	9.494	4770.0942	65334.25
50000	0.5051	12766.3359	57742.3438	5.2083	13.047	76.646	9.581	4768.5161	65334.25
55000	0.5556	12766.3359	57742.3438	5.2083	13.2134	75.681	9.46	4768.5161	65334.25
60000	0.6061	12766.3359	57742.3438	5.2087	13.0275	76.761	9.595	4769.3076	65334.25
65000	0.6566	12766.3359	57742.3438	5.2083	13.1101	76.277	9.535	4768.5161	65334.25
70000	0.7071	12766.3359	57742.3438	5.2085	13.0485	76.637	9.58	4771.6733	65334.25
75000	0.7576	12766.3359	57742.3438	5.2085	13.0209	76.8	9.6	4768.5161	65299.4297
80000	0.8081	12766.3359	57742.3438	5.2085	13.0587	76.577	9.572	4771.6733	65334.25
85000	0.8586	12766.3359	57742.3438	5.2085	13.0404	76.685	9.586	4770.0942	65299.4297
90000	0.9091	12766.3359	57742.3438	5.2087	13.0082	76.874	9.609	4770.0942	65334.25
95000	0.9596	12766.3359	57742.3438	5.2085	13.0077	76.878	9.61	4769.3076	65334.25
99000	1.0	12766.3359	57742.3438	5.2085	13.0328	76.73	9.591	4770.0942	65334.25

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0