llama-7b-SFT-qlora-wiki_DPO_ds_RM_random_1024_r_64_alpha_16

This model is a fine-tuned version of dhmeltzer/llama-7b-SFT_ds_wiki65k_1024_r_64_alpha_16_merged on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6904	0.1	19	0.6904	-0.3143	-0.3636	0.5458	0.0493	-207.3793	-204.3384	1.1224	1.1416
0.6725	0.21	38	0.6850	-0.3901	-0.4540	0.5547	0.0640	-208.2836	-205.0964	1.1270	1.1469
0.6818	0.31	57	0.6801	-0.1790	-0.2369	0.5469	0.0578	-206.1121	-202.9860	1.1465	1.1674
0.6671	0.41	76	0.6863	-0.2598	-0.3469	0.5580	0.0871	-207.2126	-203.7936	1.1468	1.1665
0.6683	0.52	95	0.6841	-0.1475	-0.2325	0.5502	0.0851	-206.0687	-202.6704	1.1388	1.1590
0.6626	0.62	114	0.6846	-0.0836	-0.1600	0.5480	0.0764	-205.3429	-202.0314	1.1263	1.1474
0.6593	0.72	133	0.6864	-0.1272	-0.2184	0.5625	0.0912	-205.9276	-202.4675	1.1106	1.1306
0.672	0.83	152	0.6857	-0.1452	-0.2334	0.5592	0.0882	-206.0777	-202.6477	1.1086	1.1293
0.6671	0.93	171	0.6855	-0.1472	-0.2350	0.5547	0.0878	-206.0934	-202.6673	1.1071	1.1270