metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.13_gpt2
    results: []

distily_bench_obj_cross_v2.13_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 2176.0
eval_frwikippl: 8832.0
eval_zhwikippl: 127488.0
eval_tinystoriesppl: 1776.0
eval_loss: 3.2370
eval_runtime: 12.9467
eval_samples_per_second: 46.344
eval_steps_per_second: 11.586

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=1.0, loss_fn=kl, layer_mapper=last, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0001
train_batch_size: 8
eval_batch_size: 4
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.5
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.0905 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.75	61.75					11.8125	19.125
0	0	1821066133504.0	158329674399744.0	25.4650	12.9198	46.44	11.61	12079595520.0	98956046499840.0
750	0.1010	2176.0	8832.0	3.2370	12.9467	46.344	11.586	1776.0	127488.0
1500	0.2020	780.0	4704.0	2.2858	12.9531	46.321	11.58	580.0	6528.0
2250	0.3030	448.0	2720.0	1.9337	12.9786	46.23	11.558	358.0	616.0
3000	0.4040	318.0	1424.0	1.6665	12.9898	46.19	11.548	252.0	264.0
3750	0.5051	252.0	968.0	1.4830	12.9776	46.233	11.558	206.0	494.0
4500	0.6061	187.0	680.0	1.2771	12.9626	46.287	11.572	146.0	404.0
5250	0.7071	146.0	556.0	1.1009	12.9778	46.233	11.558	113.0	224.0
6000	0.8081	134.0	490.0	1.0233	12.9863	46.202	11.551	104.0	179.0
6750	0.9091	125.0	464.0	0.9838	12.985	46.207	11.552	96.0	168.0
7425	1.0	124.0	462.0	0.9755	13.0256	46.063	11.516	95.0	162.0

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0