|
--- |
|
base_model: gpt2 |
|
library_name: Distily |
|
license: mit |
|
tags: |
|
- generated_from_trainer |
|
model-index: |
|
- name: distily_bench_obj_cross_v2.15_gpt2 |
|
results: [] |
|
--- |
|
|
|
# distily_bench_obj_cross_v2.15_gpt2 |
|
|
|
This student model is distilled from the teacher model [gpt2](https://huggingface.co/gpt2) using the dataset (unspecified). |
|
|
|
The [Distily](https://github.com/lapp0/distily) library was used for this distillation. |
|
|
|
It achieves the following results on the evaluation set: |
|
- eval_enwikippl: 2192.0 |
|
- eval_frwikippl: 11200.0 |
|
- eval_zhwikippl: 93184.0 |
|
- eval_tinystoriesppl: 1808.0 |
|
- eval_loss: 2.6293 |
|
- eval_runtime: 16.9228 |
|
- eval_samples_per_second: 59.092 |
|
- eval_steps_per_second: 7.386 |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. |
|
|
|
## Model description |
|
|
|
More information needed |
|
|
|
## Intended uses & limitations |
|
|
|
More information needed |
|
|
|
## Training and evaluation data |
|
|
|
More information needed |
|
--> |
|
|
|
## Training procedure |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None)) |
|
- train_embeddings: True |
|
- learning_rate: 0.0004 |
|
- train_batch_size: 8 |
|
- eval_batch_size: 8 |
|
- seed: 42 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: constant |
|
- lr_scheduler_warmup_ratio: 0.2 |
|
- num_epochs: 1.0 |
|
|
|
### Resource Usage |
|
Peak GPU Memory: 7.9368 GB |
|
|
|
### Eval-Phase Metrics |
|
| step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl | |
|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | |
|
| **teacher eval** | | 43.75 | 61.75 | | | | | 11.8125 | 19.125 | |
|
| 0 | 0 | 2473901162496.0 | 170424302305280.0 | 20.7680 | 16.794 | 59.545 | 7.443 | 4060086272.0 | 71468255805440.0 | |
|
| 1000 | 0.0808 | 688.0 | 3728.0 | 1.9530 | 16.821 | 59.449 | 7.431 | 652.0 | 2784.0 | |
|
| 2000 | 0.1616 | 1728.0 | 8256.0 | 2.4948 | 16.7878 | 59.567 | 7.446 | 1384.0 | 35584.0 | |
|
| 3000 | 0.2424 | 2040.0 | 10112.0 | 2.6087 | 16.7522 | 59.694 | 7.462 | 1720.0 | 64256.0 | |
|
| 4000 | 0.3232 | 2160.0 | 9280.0 | 2.6353 | 16.796 | 59.538 | 7.442 | 1816.0 | 57088.0 | |
|
| 5000 | 0.4040 | 1904.0 | 9088.0 | 2.5782 | 16.8206 | 59.451 | 7.431 | 1848.0 | 61440.0 | |
|
| 6000 | 0.4848 | 1840.0 | 8960.0 | 2.5344 | 16.7618 | 59.659 | 7.457 | 1592.0 | 69120.0 | |
|
| 7000 | 0.5657 | 1808.0 | 8512.0 | 2.5269 | 16.7913 | 59.555 | 7.444 | 1648.0 | 60672.0 | |
|
| 8000 | 0.6465 | 2096.0 | 8960.0 | 2.6404 | 16.8233 | 59.442 | 7.43 | 1928.0 | 137216.0 | |
|
| 9000 | 0.7273 | 2192.0 | 11200.0 | 2.6293 | 16.9228 | 59.092 | 7.386 | 1808.0 | 93184.0 | |
|
| 10000 | 0.8081 | 1944.0 | 9984.0 | 2.5759 | 16.857 | 59.323 | 7.415 | 1568.0 | 80896.0 | |
|
| 11000 | 0.8889 | 1736.0 | 9344.0 | 2.5147 | 16.8438 | 59.369 | 7.421 | 1488.0 | 48640.0 | |
|
| 12000 | 0.9697 | 2224.0 | 11840.0 | 2.6633 | 16.7839 | 59.581 | 7.448 | 1968.0 | 98816.0 | |
|
| 12375 | 1.0 | 2432.0 | 11072.0 | 2.7197 | 16.7952 | 59.541 | 7.443 | 2176.0 | 109568.0 | |
|
|
|
### Framework versions |
|
- Distily 0.2.0 |
|
- Transformers 4.44.0 |
|
- Pytorch 2.3.0 |
|
- Datasets 2.21.0 |
|
|