llama3_8b_instruct_dpo_bwgenerator

This model is a fine-tuned version of NanQiangHF/llama3_8b_instruct_bwgenerator on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.247	0.0719	1000	0.0906	-3.7216	-11.8877	0.9686	8.1662	-186.6814	-75.7941	0.8504	0.3080
0.083	0.1438	2000	0.0775	-4.5564	-14.1375	0.9764	9.5811	-209.1791	-84.1423	0.8989	0.3418
0.0623	0.2157	3000	0.0734	-4.5379	-14.4993	0.9770	9.9614	-212.7973	-83.9572	0.9082	0.3471
0.069	0.2876	4000	0.0713	-4.5601	-14.6450	0.9777	10.0850	-214.2546	-84.1790	0.9145	0.3514
0.0752	0.3595	5000	0.0706	-4.4918	-14.6244	0.9793	10.1326	-214.0477	-83.4960	0.9181	0.3533
0.0723	0.4313	6000	0.0710	-4.6381	-14.8167	0.9780	10.1787	-215.9714	-84.9590	0.9187	0.3542
0.0852	0.5032	7000	0.0705	-4.6251	-14.8143	0.9783	10.1893	-215.9474	-84.8290	0.9189	0.3542
0.0811	0.5751	8000	0.0706	-4.6409	-14.8406	0.9780	10.1997	-216.2102	-84.9870	0.9185	0.3538
0.0762	0.6470	9000	0.0699	-4.6161	-14.8083	0.9790	10.1921	-215.8869	-84.7398	0.9186	0.3541
0.0686	0.7189	10000	0.0703	-4.6164	-14.8042	0.9790	10.1878	-215.8462	-84.7421	0.9185	0.3537
0.061	0.7908	11000	0.0705	-4.6191	-14.8169	0.9793	10.1977	-215.9726	-84.7695	0.9207	0.3556
0.0786	0.8627	12000	0.0698	-4.6080	-14.7978	0.9793	10.1898	-215.7822	-84.6584	0.9195	0.3546
0.073	0.9346	13000	0.0706	-4.6241	-14.8342	0.9780	10.2101	-216.1456	-84.8191	0.9202	0.3552