zephyr-7b-dpo-qlora

This model is a fine-tuned version of TII-Frontier-Team/falcon3-3b-instruct on the TII-Frontier-Team/Reasoning_DPO dataset. It achieves the following results on the evaluation set:

Loss: 0.0299
Rewards/chosen: -4.6362
Rewards/rejected: -10.4479
Rewards/accuracies: 0.9306
Rewards/margins: 5.8117
Logps/rejected: -1080.7013
Logps/chosen: -496.4129
Logits/rejected: 2.0470
Logits/chosen: 2.2558

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 4
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 4
total_train_batch_size: 128
total_eval_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6913	0.0315	100	0.6911	0.0007	-0.0036	0.6220	0.0042	-36.2718	-32.7285	-1.6824	-1.6348
0.6742	0.0629	200	0.6751	0.0003	-0.0454	0.6276	0.0458	-40.4596	-32.7631	-1.5097	-1.4586
0.6081	0.0944	300	0.5872	-0.5193	-0.8644	0.6619	0.3451	-122.3552	-84.7303	-0.4701	-0.3830
0.4463	0.1258	400	0.3978	-2.0312	-3.2212	0.7190	1.1900	-358.0407	-235.9217	-0.3673	-0.2101
0.3548	0.1573	500	0.3048	-2.5142	-4.1605	0.7698	1.6464	-451.9689	-284.2137	0.4417	0.6033
0.3014	0.1887	600	0.2395	-2.7662	-4.8033	0.7963	2.0371	-516.2451	-309.4138	1.0026	1.1670
0.25	0.2202	700	0.1989	-3.1039	-5.4194	0.8235	2.3155	-577.8538	-343.1828	1.3421	1.5051
0.2163	0.2517	800	0.1564	-3.4535	-6.3881	0.8369	2.9346	-674.7255	-378.1511	1.8084	1.9697
0.178	0.2831	900	0.1349	-3.4355	-6.5411	0.8586	3.1056	-690.0276	-376.3503	1.7688	1.9492
0.1736	0.3146	1000	0.1127	-3.5471	-6.9599	0.8668	3.4128	-731.9055	-387.5069	2.0848	2.2440
0.1474	0.3460	1100	0.0982	-3.6177	-7.2322	0.8799	3.6145	-759.1403	-394.5700	1.8280	2.0076
0.1382	0.3775	1200	0.0819	-4.3123	-8.3603	0.8862	4.0480	-871.9455	-464.0287	2.0966	2.2833
0.1133	0.4089	1300	0.0714	-4.0671	-8.3309	0.8955	4.2638	-869.0029	-439.5055	1.9082	2.1044
0.1209	0.4404	1400	0.0634	-4.8366	-9.4739	0.8933	4.6374	-983.3081	-516.4533	2.0574	2.2678
0.1057	0.4718	1500	0.0575	-4.1835	-8.8581	0.9019	4.6746	-921.7241	-451.1488	2.0907	2.2780
0.1057	0.5033	1600	0.0536	-4.2093	-8.9250	0.9131	4.7157	-928.4156	-453.7231	2.0198	2.2136
0.0881	0.5348	1700	0.0490	-4.4577	-9.3694	0.9101	4.9118	-972.8605	-478.5644	1.8760	2.0804
0.0847	0.5662	1800	0.0441	-4.2531	-9.4108	0.9131	5.1578	-977.0005	-458.1054	2.0999	2.2904
0.0713	0.5977	1900	0.0411	-4.4101	-9.6543	0.9168	5.2442	-1001.3448	-473.8065	2.0887	2.2861
0.0553	0.6291	2000	0.0378	-4.9687	-10.5782	0.9123	5.6095	-1093.7402	-529.6686	2.0469	2.2608
0.0668	0.6606	2100	0.0362	-4.7485	-10.3227	0.9190	5.5741	-1068.1823	-507.6488	2.1354	2.3368
0.0528	0.6920	2200	0.0356	-4.6766	-10.2170	0.9175	5.5404	-1057.6173	-500.4605	1.9572	2.1594
0.0596	0.7235	2300	0.0340	-4.6180	-10.2121	0.9235	5.5942	-1057.1299	-494.5929	2.0041	2.2117
0.063	0.7550	2400	0.0328	-4.5357	-10.1876	0.9257	5.6519	-1054.6713	-486.3653	2.1493	2.3488
0.0558	0.7864	2500	0.0311	-4.7155	-10.5680	0.9261	5.8526	-1092.7185	-504.3435	2.1208	2.3275
0.0552	0.8179	2600	0.0312	-4.6574	-10.3658	0.9254	5.7084	-1072.4943	-498.5399	2.0544	2.2592
0.066	0.8493	2700	0.0305	-4.6506	-10.4766	0.9287	5.8259	-1083.5740	-497.8611	2.0914	2.2968
0.0568	0.8808	2800	0.0302	-4.6423	-10.4629	0.9302	5.8206	-1082.2051	-497.0266	2.0957	2.3026
0.0602	0.9122	2900	0.0299	-4.6260	-10.4608	0.9299	5.8348	-1081.9958	-495.3989	2.0861	2.2911
0.0634	0.9437	3000	0.0298	-4.6454	-10.4843	0.9313	5.8389	-1084.3455	-497.3409	2.0655	2.2739
0.0602	0.9751	3100	0.0299	-4.6289	-10.4404	0.9302	5.8116	-1079.9603	-495.6860	2.0537	2.2623

Framework versions

PEFT 0.13.0
Transformers 4.45.1
Pytorch 2.4.1+cu121
Datasets 3.0.1
Tokenizers 0.20.0

RedaAlami
/

zephyr-7b-dpo-qlora

zephyr-7b-dpo-qlora

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Evaluation results