gpt1B_DPO_model

This model is a fine-tuned version of AI-Sweden-Models/gpt-sw3-1.3b on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.2383	0.2	50	0.2344	0.1296	-1.3092	0.9967	1.4389	-234.8370	-125.7705	-3.0903	-3.2537
0.0573	0.4	100	0.0615	0.1058	-3.2004	0.9967	3.3063	-253.7490	-126.0084	-2.9086	-3.0985
0.0262	0.6	150	0.0291	-0.0050	-4.5248	0.9967	4.5198	-266.9924	-127.1163	-2.8221	-3.0267
0.0191	0.79	200	0.0205	0.0107	-4.9990	0.9967	5.0096	-271.7344	-126.9600	-2.8042	-3.0131
0.0106	0.99	250	0.0171	-0.0051	-5.3187	0.9967	5.3135	-274.9313	-127.1180	-2.7884	-3.0001
0.0129	1.19	300	0.0148	0.0024	-5.4879	1.0	5.4902	-276.6234	-127.0432	-2.7840	-2.9962
0.0125	1.39	350	0.0137	0.0243	-5.5389	1.0	5.5632	-277.1337	-126.8233	-2.7873	-2.9994
0.0079	1.59	400	0.0129	0.0313	-5.5885	1.0	5.6198	-277.6297	-126.7539	-2.7878	-3.0000
0.0077	1.79	450	0.0126	0.0332	-5.6246	1.0	5.6578	-277.9906	-126.7342	-2.7878	-2.9998
0.0073	1.99	500	0.0126	0.0322	-5.6582	1.0	5.6905	-278.3270	-126.7444	-2.7863	-2.9985
0.0087	2.19	550	0.0123	0.0334	-5.6819	1.0	5.7153	-278.5634	-126.7327	-2.7862	-2.9983
0.0111	2.38	600	0.0123	0.0324	-5.6898	1.0	5.7222	-278.6425	-126.7427	-2.7862	-2.9984
0.0086	2.58	650	0.0122	0.0357	-5.6877	1.0	5.7234	-278.6218	-126.7101	-2.7863	-2.9984
0.0067	2.78	700	0.0122	0.0352	-5.6897	1.0	5.7249	-278.6414	-126.7143	-2.7860	-2.9981
0.0067	2.98	750	0.0123	0.0352	-5.6889	1.0	5.7242	-278.6341	-126.7145	-2.7863	-2.9985