metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.15_gpt2
    results: []

distily_bench_obj_cross_v2.15_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 2352.0
eval_frwikippl: 10240.0
eval_zhwikippl: 109056.0
eval_tinystoriesppl: 1920.0
eval_loss: 2.6449
eval_runtime: 17.0132
eval_samples_per_second: 58.778
eval_steps_per_second: 7.347

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=1.0, loss_fn=mse, layer_mapper=last, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 0.0004
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
lr_scheduler_warmup_ratio: 0.2
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.0892 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.75	61.75					11.8125	19.125
0	0	1984274890752.0	213305255788544.0	21.1260	17.0018	58.817	7.352	3774873600.0	74217034874880.0
1000	0.0808	516.0	3952.0	1.7644	17.0372	58.695	7.337	412.0	3760.0
2000	0.1616	516.0	3872.0	1.7584	17.0128	58.779	7.347	462.0	860.0
3000	0.2424	864.0	4672.0	2.0719	17.0071	58.799	7.35	788.0	2448.0
4000	0.3232	1888.0	9344.0	2.5277	17.1241	58.397	7.3	1696.0	26880.0
5000	0.4040	2008.0	7712.0	2.5758	17.0318	58.714	7.339	2256.0	48128.0
6000	0.4848	2352.0	9984.0	2.6397	17.1643	58.26	7.283	1856.0	54528.0
7000	0.5657	2416.0	12096.0	2.6472	17.0957	58.494	7.312	1880.0	109568.0
8000	0.6465	2448.0	9856.0	2.6570	17.0094	58.791	7.349	1960.0	115712.0
9000	0.7273	2352.0	10240.0	2.6449	17.0132	58.778	7.347	1920.0	109056.0
10000	0.8081	2320.0	9344.0	2.6556	17.0386	58.69	7.336	2096.0	87040.0
11000	0.8889	2304.0	12224.0	2.6333	17.0346	58.704	7.338	1888.0	130048.0
12000	0.9697	2208.0	10368.0	2.6107	17.0435	58.674	7.334	1808.0	98816.0
12375	1.0	2256.0	10304.0	2.6066	17.0663	58.595	7.324	1696.0	80896.0

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0