metadata

library_name: transformers
license: apache-2.0
base_model: alignment-handbook/zephyr-7b-sft-full
tags:
  - alignment-handbook
  - trl
  - dpo
  - generated_from_trainer
  - trl
  - dpo
  - generated_from_trainer
datasets:
  - data/zephyr_uf_rlced_conifer_ref
model-index:
  - name: zephyr-7b-uf-rlced-conifer-group-dpo-2e-alr-0.01-1e
    results: []

zephyr-7b-uf-rlced-conifer-group-dpo-2e-alr-0.01-1e

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-full on the data/zephyr_uf_rlced_conifer_ref dataset. It achieves the following results on the evaluation set:

Loss: 0.2572
Rewards/chosen: -2.2030
Rewards/rejected: -5.8511
Rewards/accuracies: 0.8675
Rewards/margins: 3.6481
Logps/rejected: -988.8447
Logps/chosen: -612.7692
Logits/rejected: 2.2087
Logits/chosen: 0.2455
Excess Loss: 0.0532
Alpha 0 Uf: 0.6287
Alpha 1 Rlced Conifer: 0.3713
Rewards/chosen 1 Rlced Conifer: -2.2869
Rewards/rejected 1 Rlced Conifer: -6.6795
Rewards/accuracies 1 Rlced Conifer: 0.9030
Rewards/margins 1 Rlced Conifer: 4.3926
Logps/rejected 1 Rlced Conifer: -1115.4857
Logps/chosen 1 Rlced Conifer: -652.2682
Logits/rejected 1 Rlced Conifer: 2.0086
Logits/chosen 1 Rlced Conifer: -0.0625
Task Loss 1 Rlced Conifer: 0.1962
Task Excess Loss 1 Rlced Conifer: 0.0645
Rewards/chosen 0 Uf: -1.8688
Rewards/rejected 0 Uf: -2.8942
Rewards/accuracies 0 Uf: 0.7397
Rewards/margins 0 Uf: 1.0254
Logps/rejected 0 Uf: -531.0295
Logps/chosen 0 Uf: -476.1427
Logits/rejected 0 Uf: 3.1191
Logits/chosen 0 Uf: 1.2438
Task Loss 0 Uf: 0.5240
Task Excess Loss 0 Uf: 0.0664

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 8
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 4
total_train_batch_size: 256
total_eval_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen	Excess Loss	Alpha 0 Uf	Alpha 1 Rlced Conifer	Rewards/chosen 1 Rlced Conifer	Rewards/rejected 1 Rlced Conifer	Rewards/accuracies 1 Rlced Conifer	Rewards/margins 1 Rlced Conifer	Logps/rejected 1 Rlced Conifer	Logps/chosen 1 Rlced Conifer	Logits/rejected 1 Rlced Conifer	Logits/chosen 1 Rlced Conifer	Task Loss 1 Rlced Conifer	Task Excess Loss 1 Rlced Conifer	Rewards/chosen 0 Uf	Rewards/rejected 0 Uf	Rewards/accuracies 0 Uf	Rewards/margins 0 Uf	Logps/rejected 0 Uf	Logps/chosen 0 Uf	Logits/rejected 0 Uf	Logits/chosen 0 Uf	Task Loss 0 Uf	Task Excess Loss 0 Uf
0.1859	0.2498	180	0.2923	-1.9944	-4.7058	0.8524	2.7115	-874.3204	-591.9084	0.9002	-0.1249	0.0816	0.4921	0.5079	-2.0854	-5.3481	0.8866	3.2627	-982.3445	-632.1208	0.7201	-0.3404	0.2278	0.0919	-1.6415	-2.4124	0.7158	0.7709	-482.8499	-453.4123	1.6757	0.5476	0.5480	0.1058
0.1646	0.4997	360	0.2654	-2.3703	-5.8263	0.8637	3.4560	-986.3652	-629.4960	1.5662	-0.2281	0.0630	0.5888	0.4112	-2.4491	-6.6305	0.8988	4.1814	-1110.5859	-668.4894	1.4570	-0.4928	0.2047	0.0719	-2.0521	-2.9610	0.7379	0.9089	-537.7054	-494.4703	2.1435	0.6282	0.5444	0.0878
0.162	0.7495	540	0.2603	-2.0719	-5.7198	0.8637	3.6479	-975.7140	-599.6583	1.8052	-0.3472	0.0563	0.6201	0.3799	-2.1783	-6.5775	0.9020	4.3992	-1105.2861	-641.4061	1.6728	-0.6324	0.1991	0.0667	-1.6637	-2.6673	0.7294	1.0036	-508.3393	-455.6315	2.4641	0.5657	0.5322	0.0717
0.1476	0.9993	720	0.2572	-2.2030	-5.8511	0.8675	3.6481	-988.8447	-612.7692	2.2087	0.2455	0.0532	0.6287	0.3713	-2.2869	-6.6795	0.9030	4.3926	-1115.4857	-652.2682	2.0086	-0.0625	0.1962	0.0645	-1.8688	-2.8942	0.7397	1.0254	-531.0295	-476.1427	3.1191	1.2438	0.5240	0.0664

Framework versions

Transformers 4.44.2
Pytorch 2.2.0a0+81ea7a4
Datasets 2.21.0
Tokenizers 0.19.1