qwen_cCPO_entropy

This model is a fine-tuned version of trl-lib/qwen1.5-0.5b-sft on the yakazimir/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.4698
Rewards/chosen: -1.7544
Rewards/rejected: -2.3724
Rewards/accuracies: 0.6840
Rewards/margins: 0.6180
Logps/rejected: -2.3724
Logps/chosen: -1.7544
Logits/rejected: 0.2213
Logits/chosen: 0.1180

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-06
train_batch_size: 2
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 16
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3.0

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.547	0.2141	400	0.5455	-1.3437	-1.4812	0.5579	0.1374	-1.4812	-1.3437	0.3541	0.2674
0.5301	0.4282	800	0.5165	-1.3889	-1.6415	0.5927	0.2526	-1.6415	-1.3889	0.4272	0.3345
0.5265	0.6422	1200	0.4985	-1.4579	-1.8204	0.6224	0.3625	-1.8204	-1.4579	0.3699	0.2760
0.4765	0.8563	1600	0.4935	-1.4994	-1.8829	0.6380	0.3836	-1.8829	-1.4994	0.3198	0.2257
0.5542	1.0704	2000	0.4872	-1.4687	-1.8582	0.6372	0.3895	-1.8582	-1.4687	0.3054	0.2090
0.4732	1.2845	2400	0.4775	-1.6420	-2.1625	0.6669	0.5204	-2.1625	-1.6420	0.3805	0.2752
0.5055	1.4986	2800	0.4755	-1.6156	-2.1129	0.6639	0.4973	-2.1129	-1.6156	0.4048	0.2981
0.4945	1.7127	3200	0.4738	-1.5940	-2.0956	0.6677	0.5016	-2.0956	-1.5940	0.3909	0.2834
0.4619	1.9267	3600	0.4700	-1.6914	-2.2530	0.6728	0.5617	-2.2530	-1.6914	0.3536	0.2473
0.4109	2.1408	4000	0.4699	-1.7062	-2.2883	0.6780	0.5822	-2.2883	-1.7062	0.3677	0.2556
0.4282	2.3549	4400	0.4707	-1.7749	-2.3952	0.6877	0.6202	-2.3952	-1.7749	0.2280	0.1239
0.4299	2.5690	4800	0.4704	-1.7425	-2.3507	0.6803	0.6082	-2.3507	-1.7425	0.3027	0.1929
0.4414	2.7831	5200	0.4698	-1.7506	-2.3686	0.6847	0.6181	-2.3686	-1.7506	0.2344	0.1302
0.404	2.9972	5600	0.4698	-1.7544	-2.3724	0.6840	0.6180	-2.3724	-1.7544	0.2213	0.1180

Framework versions

Transformers 4.44.2
Pytorch 2.2.2+cu121
Datasets 2.18.0
Tokenizers 0.19.1

yakazimir
/

qwen_cCPO_entropy

qwen_cCPO_entropy

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for yakazimir/qwen_cCPO_entropy

Dataset used to train yakazimir/qwen_cCPO_entropy

Evaluation results