File size: 3,481 Bytes
aed0425
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
base_model: gpt2
library_name: Distily
license: mit
tags:
- generated_from_trainer
model-index:
- name: distily_bench_obj_cross_v2.15_gpt2
  results: []
---

# distily_bench_obj_cross_v2.15_gpt2

This student model is distilled from the teacher model [gpt2](https://huggingface.co/gpt2) using the dataset (unspecified).

The [Distily](https://github.com/lapp0/distily) library was used for this distillation.

It achieves the following results on the evaluation set:
- eval_enwikippl: 2352.0
- eval_frwikippl: 10240.0
- eval_zhwikippl: 109056.0
- eval_tinystoriesppl: 1920.0
- eval_loss: 2.6449
- eval_runtime: 17.0132
- eval_samples_per_second: 58.778
- eval_steps_per_second: 7.347

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment.

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed
-->

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=1.0, loss_fn=mse, layer_mapper=last, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
- train_embeddings: True
- learning_rate: 0.0004
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: constant
- lr_scheduler_warmup_ratio: 0.2
- num_epochs: 1.0

### Resource Usage
Peak GPU Memory: 8.0892 GB

### Eval-Phase Metrics
| step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| **teacher eval** |  | 43.75 | 61.75 |  |  |  |  | 11.8125 | 19.125 |
| 0 | 0 | 1984274890752.0 | 213305255788544.0 | 21.1260 | 17.0018 | 58.817 | 7.352 | 3774873600.0 | 74217034874880.0 |
| 1000 | 0.0808 | 516.0 | 3952.0 | 1.7644 | 17.0372 | 58.695 | 7.337 | 412.0 | 3760.0 |
| 2000 | 0.1616 | 516.0 | 3872.0 | 1.7584 | 17.0128 | 58.779 | 7.347 | 462.0 | 860.0 |
| 3000 | 0.2424 | 864.0 | 4672.0 | 2.0719 | 17.0071 | 58.799 | 7.35 | 788.0 | 2448.0 |
| 4000 | 0.3232 | 1888.0 | 9344.0 | 2.5277 | 17.1241 | 58.397 | 7.3 | 1696.0 | 26880.0 |
| 5000 | 0.4040 | 2008.0 | 7712.0 | 2.5758 | 17.0318 | 58.714 | 7.339 | 2256.0 | 48128.0 |
| 6000 | 0.4848 | 2352.0 | 9984.0 | 2.6397 | 17.1643 | 58.26 | 7.283 | 1856.0 | 54528.0 |
| 7000 | 0.5657 | 2416.0 | 12096.0 | 2.6472 | 17.0957 | 58.494 | 7.312 | 1880.0 | 109568.0 |
| 8000 | 0.6465 | 2448.0 | 9856.0 | 2.6570 | 17.0094 | 58.791 | 7.349 | 1960.0 | 115712.0 |
| 9000 | 0.7273 | 2352.0 | 10240.0 | 2.6449 | 17.0132 | 58.778 | 7.347 | 1920.0 | 109056.0 |
| 10000 | 0.8081 | 2320.0 | 9344.0 | 2.6556 | 17.0386 | 58.69 | 7.336 | 2096.0 | 87040.0 |
| 11000 | 0.8889 | 2304.0 | 12224.0 | 2.6333 | 17.0346 | 58.704 | 7.338 | 1888.0 | 130048.0 |
| 12000 | 0.9697 | 2208.0 | 10368.0 | 2.6107 | 17.0435 | 58.674 | 7.334 | 1808.0 | 98816.0 |
| 12375 | 1.0 | 2256.0 | 10304.0 | 2.6066 | 17.0663 | 58.595 | 7.324 | 1696.0 | 80896.0 |

### Framework versions
- Distily 0.2.0
- Transformers 4.44.0
- Pytorch 2.3.0
- Datasets 2.21.0