doplhin-dpo

This model is a fine-tuned version of cognitivecomputations/dolphin-2.1-mistral-7b on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
3.7986	0.13	700	2.1962	-19.2854	-23.2541	0.6680	3.9687	-531.9778	-535.6807	-2.6393	-2.7332
1.3794	0.25	1400	1.5931	-24.2833	-32.1549	0.7393	7.8716	-620.9865	-585.6600	-2.4941	-2.6078
1.7768	0.38	2100	1.2640	-24.9513	-33.2837	0.7618	8.3324	-632.2739	-592.3398	-1.5676	-1.9552
1.0764	0.51	2800	1.1802	-24.8340	-32.7263	0.7807	7.8923	-626.7006	-591.1669	-2.2188	-2.3807
1.1698	0.64	3500	1.1290	-17.1234	-26.7346	0.7982	9.6112	-566.7830	-514.0612	-2.6586	-2.7169
1.1884	0.76	4200	1.0909	-23.1635	-33.5559	0.8044	10.3924	-634.9959	-574.4622	-2.5670	-2.6170
0.6424	0.89	4900	1.0809	-22.5048	-33.0285	0.8076	10.5237	-629.7220	-567.8747	-2.5481	-2.5972