zephyr-7b-dpo-qlora

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-qlora on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.4880
Rewards/chosen: -2.8615
Rewards/rejected: -3.9313
Rewards/accuracies: 0.7262
Rewards/margins: 1.0698
Logps/rejected: -626.2534
Logps/chosen: -549.3907
Logits/rejected: 1.3412
Logits/chosen: 0.7713

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 1
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 3
gradient_accumulation_steps: 4
total_train_batch_size: 12
total_eval_batch_size: 24
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6884	0.02	100	0.6868	0.0390	0.0284	0.6146	0.0106	-230.2779	-259.3362	-2.3476	-2.3366
0.6654	0.04	200	0.6657	0.0334	-0.0194	0.6399	0.0528	-235.0622	-259.9052	-2.2635	-2.2585
0.6346	0.06	300	0.6431	-0.2564	-0.3692	0.6533	0.1128	-270.0399	-288.8787	-2.2107	-2.2217
0.5888	0.08	400	0.6162	-0.4195	-0.6312	0.6518	0.2118	-296.2420	-305.1884	-1.9579	-1.9905
0.5806	0.1	500	0.5916	-1.3171	-1.6507	0.6637	0.3337	-398.1920	-394.9468	-0.4990	-0.5253
0.6219	0.12	600	0.5753	-1.1344	-1.5063	0.6503	0.3719	-383.7478	-376.6808	0.0384	-0.0361
0.5586	0.14	700	0.5733	-0.7892	-1.1878	0.6667	0.3986	-351.8957	-342.1609	0.3073	0.2473
0.6123	0.16	800	0.5578	-1.2731	-1.7042	0.6652	0.4311	-403.5397	-390.5542	1.0809	1.0327
0.555	0.18	900	0.5461	-1.1941	-1.8087	0.6771	0.6146	-413.9875	-382.6491	1.4158	1.3993
0.4905	0.2	1000	0.5463	-1.2469	-1.9528	0.6890	0.7058	-428.3945	-387.9334	0.8211	0.7732
0.5214	0.22	1100	0.5356	-1.2786	-1.8992	0.6979	0.6206	-423.0347	-391.1008	1.3945	1.4163
0.4988	0.24	1200	0.5307	-1.2179	-1.9293	0.6979	0.7115	-426.0503	-385.0261	1.0273	0.9228
0.5324	0.26	1300	0.5320	-1.4512	-2.1779	0.7024	0.7267	-450.9060	-408.3595	0.9344	0.5917
0.5286	0.27	1400	0.5193	-1.3777	-2.1412	0.7039	0.7634	-447.2371	-401.0145	1.1979	0.8244
0.6095	0.29	1500	0.5206	-1.1730	-1.8883	0.7009	0.7153	-421.9497	-380.5422	0.3598	-0.0238
0.5627	0.31	1600	0.5225	-1.8811	-2.7733	0.6935	0.8922	-510.4463	-451.3462	0.7395	0.4147
0.5222	0.33	1700	0.5210	-1.1883	-1.8477	0.7143	0.6593	-417.8853	-382.0739	-0.0643	-0.3844
0.5163	0.35	1800	0.5219	-1.1780	-1.9783	0.7247	0.8003	-430.9522	-381.0428	1.3000	0.9605
0.511	0.37	1900	0.5214	-1.8532	-2.7395	0.7188	0.8863	-507.0662	-448.5622	1.3052	0.9550
0.484	0.39	2000	0.5161	-1.7800	-2.6182	0.7188	0.8382	-494.9370	-441.2427	1.6339	1.3132
0.4863	0.41	2100	0.5183	-2.7826	-3.8427	0.7158	1.0600	-617.3857	-541.5035	2.3428	2.0461
0.5233	0.43	2200	0.5115	-1.7702	-2.6185	0.7173	0.8483	-494.9643	-440.2580	0.9791	0.5628
0.5343	0.45	2300	0.5079	-1.4313	-2.2210	0.7202	0.7897	-455.2213	-406.3701	1.0255	0.5469
0.5251	0.47	2400	0.5088	-2.7117	-3.7995	0.7173	1.0878	-613.0708	-534.4126	2.1153	1.5133
0.5104	0.49	2500	0.5006	-2.9970	-4.0022	0.7202	1.0052	-633.3362	-562.9377	2.2889	1.7461
0.429	0.51	2600	0.5238	-3.6282	-4.8032	0.7143	1.1750	-713.4386	-626.0600	3.6631	3.2827
0.4255	0.53	2700	0.4993	-2.4946	-3.5067	0.7188	1.0121	-583.7889	-512.7010	2.1920	1.6873
0.4733	0.55	2800	0.4990	-3.2116	-4.2800	0.7202	1.0684	-661.1174	-584.3987	2.6796	2.2111
0.5394	0.57	2900	0.5040	-2.9132	-3.9276	0.7158	1.0143	-625.8766	-554.5653	1.7758	1.2351
0.5128	0.59	3000	0.5061	-2.5974	-3.5725	0.7173	0.9750	-590.3638	-522.9818	2.1284	1.6663
0.5215	0.61	3100	0.4960	-2.2632	-3.1876	0.7188	0.9245	-551.8787	-489.5560	1.4432	0.8594
0.5023	0.63	3200	0.4999	-2.8630	-3.9641	0.7128	1.1011	-629.5237	-549.5392	1.9057	1.2951
0.5042	0.65	3300	0.4904	-2.8448	-3.8793	0.7307	1.0345	-621.0500	-547.7245	1.9776	1.4334
0.498	0.67	3400	0.4879	-2.8423	-3.8097	0.7321	0.9673	-614.0843	-547.4754	1.4781	0.9608
0.4987	0.69	3500	0.4902	-2.6926	-3.7172	0.7307	1.0246	-604.8372	-532.4977	1.3819	0.8557
0.5824	0.71	3600	0.4908	-2.5673	-3.5933	0.7292	1.0260	-592.4445	-519.9661	1.1037	0.5336
0.425	0.73	3700	0.4906	-2.7666	-3.8246	0.7307	1.0580	-615.5826	-539.9020	1.2903	0.7257
0.4756	0.75	3800	0.4916	-2.8732	-3.9598	0.7292	1.0866	-629.0961	-550.5607	1.5015	0.9387
0.4597	0.77	3900	0.4896	-2.8617	-3.9425	0.7277	1.0808	-627.3712	-549.4086	1.3350	0.7636
0.4649	0.79	4000	0.4885	-2.8682	-3.9370	0.7232	1.0688	-626.8230	-550.0615	1.2903	0.7213
0.4689	0.8	4100	0.4880	-2.8425	-3.9060	0.7232	1.0634	-623.7166	-547.4950	1.2495	0.6763
0.4275	0.82	4200	0.4877	-2.8671	-3.9353	0.7232	1.0682	-626.6478	-549.9532	1.3067	0.7331
0.5325	0.84	4300	0.4881	-2.8855	-3.9630	0.7262	1.0775	-629.4202	-551.7905	1.3795	0.8070
0.532	0.86	4400	0.4881	-2.8672	-3.9406	0.7277	1.0734	-627.1785	-549.9610	1.3435	0.7732
0.4558	0.88	4500	0.4879	-2.8560	-3.9259	0.7262	1.0699	-625.7067	-548.8392	1.3411	0.7711
0.5541	0.9	4600	0.4882	-2.8601	-3.9295	0.7262	1.0694	-626.0704	-549.2481	1.3428	0.7729
0.5743	0.92	4700	0.4879	-2.8641	-3.9344	0.7262	1.0702	-626.5551	-549.6526	1.3445	0.7755
0.4657	0.94	4800	0.4880	-2.8626	-3.9322	0.7292	1.0696	-626.3386	-549.4993	1.3437	0.7749
0.5126	0.96	4900	0.4880	-2.8636	-3.9339	0.7277	1.0703	-626.5126	-549.6042	1.3440	0.7748
0.3967	0.98	5000	0.4880	-2.8643	-3.9344	0.7262	1.0702	-626.5614	-549.6658	1.3424	0.7736

Framework versions

PEFT 0.7.1
Transformers 4.36.2
Pytorch 2.2.1+cu121
Datasets 2.14.6
Tokenizers 0.15.2

chanchan7
/

zephyr-7b-dpo-qlora

zephyr-7b-dpo-qlora

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for chanchan7/zephyr-7b-dpo-qlora

Dataset used to train chanchan7/zephyr-7b-dpo-qlora

Evaluation results