metadata

base_model: roneneldan/TinyStories-33M
library_name: Distily
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.10
    results: []

distily_bench_obj_cross_v2.10

This student model is distilled from the teacher model roneneldan/TinyStories-33M using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 132.5935
eval_frwikippl: 19405.3008
eval_zhwikippl: 53229.7070
eval_tinystoriesppl: 9.1860
eval_loss: 1.2126
eval_runtime: 13.0629
eval_samples_per_second: 76.553
eval_steps_per_second: 9.569

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 4e-06
train_batch_size: 1
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 6.6064 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		169.9865	47377.9414					3.9789	4998.1294
0	0	50480.5703	85684.4844	6.8305	13.0395	76.69	9.586	33932.0586	94692.1562
5000	0.0505	132.2037	19438.1211	1.2135	13.0309	76.741	9.593	9.1868	53144.5430
10000	0.1010	132.4087	19372.5156	1.2127	13.048	76.64	9.58	9.1830	53201.2852
15000	0.1515	132.5935	19416.2402	1.2128	13.0596	76.572	9.571	9.1887	53144.5430
20000	0.2020	132.4292	19367.0664	1.2127	13.037	76.705	9.588	9.1955	53343.4375
25000	0.2525	132.5113	19367.0664	1.2126	13.0647	76.542	9.568	9.1890	53258.0898
30000	0.3030	132.5935	19383.4375	1.2125	13.0105	76.861	9.608	9.1845	53201.2852
35000	0.3535	132.3267	19372.5156	1.2127	13.1134	76.258	9.532	9.1754	53229.7070
40000	0.4040	132.5935	19367.0664	1.2127	13.0356	76.713	9.589	9.1928	53229.7070
45000	0.4545	132.4908	19372.5156	1.2126	13.0611	76.563	9.57	9.1826	53258.0898
50000	0.5051	132.2447	19405.3008	1.2126	13.07	76.511	9.564	9.1803	53286.5391
55000	0.5556	132.6346	19405.3008	1.2126	13.0134	76.844	9.605	9.1917	53229.7070
60000	0.6061	132.6346	19405.3008	1.2126	13.0453	76.656	9.582	9.1883	53258.0898
65000	0.6566	132.6346	19394.3652	1.2126	13.0475	76.643	9.58	9.1928	53258.0898
70000	0.7071	132.5935	19427.1680	1.2125	13.0602	76.568	9.571	9.1830	53229.7070
75000	0.7576	132.4292	19405.3008	1.2126	13.0658	76.535	9.567	9.1788	53229.7070
80000	0.8081	132.6346	19405.3008	1.2127	13.0497	76.63	9.579	9.1871	53229.7070
85000	0.8586	132.5935	19405.3008	1.2126	13.0439	76.664	9.583	9.1879	53229.7070
90000	0.9091	132.5935	19405.3008	1.2126	13.0368	76.706	9.588	9.1814	53229.7070
95000	0.9596	132.5935	19405.3008	1.2126	13.0326	76.731	9.591	9.1868	53229.7070
99000	1.0	132.5935	19405.3008	1.2126	13.0629	76.553	9.569	9.1860	53229.7070

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0