zephyr-7b-gpo-v0-i1

This model is a fine-tuned version of DUAL-GPO/zephyr-7b-gpo-update3-i0 on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.1128
Rewards/chosen: -0.3200
Rewards/rejected: -0.3706
Rewards/accuracies: 0.4955
Rewards/margins: 0.0506
Logps/rejected: -621.5818
Logps/chosen: -585.8446
Logits/rejected: -1.9142
Logits/chosen: -2.0965

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 2
eval_batch_size: 2
seed: 42
distributed_type: multi-GPU
num_devices: 3
gradient_accumulation_steps: 2
total_train_batch_size: 12
total_eval_batch_size: 6
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.3416	0.02	100	0.0447	-0.0994	-0.1161	0.5883	0.0167	-367.1221	-365.3260	-1.7202	-1.8827
0.2571	0.05	200	0.0858	-0.1849	-0.2159	0.4790	0.0310	-466.8627	-450.7509	-1.8599	-2.0364
0.2771	0.07	300	0.0910	-0.2419	-0.2769	0.4775	0.0350	-527.8735	-507.7906	-1.9087	-2.0909
0.2561	0.1	400	0.1127	-0.4661	-0.5086	0.4895	0.0425	-759.5652	-731.9658	-1.9571	-2.1511
0.2604	0.12	500	0.0826	-0.3221	-0.3613	0.4835	0.0393	-612.2919	-587.9281	-1.8643	-2.0449
0.2778	0.14	600	0.1033	-0.2940	-0.3303	0.4760	0.0363	-581.3212	-559.9218	-1.8588	-2.0387
0.2631	0.17	700	0.1084	-0.3587	-0.4024	0.4865	0.0437	-653.3798	-624.5897	-1.8458	-2.0252
0.2264	0.19	800	0.1158	-0.2355	-0.2734	0.4731	0.0378	-524.3303	-501.3899	-1.8726	-2.0501
0.2593	0.22	900	0.1048	-0.2730	-0.3214	0.4865	0.0485	-572.4186	-538.8648	-1.7883	-1.9593
0.2248	0.24	1000	0.1122	-0.2753	-0.3216	0.4760	0.0463	-572.5806	-541.1548	-1.8308	-2.0088
0.2345	0.26	1100	0.1249	-0.2594	-0.2977	0.4581	0.0382	-548.6310	-525.3046	-1.8628	-2.0406
0.2	0.29	1200	0.1212	-0.3796	-0.4250	0.4925	0.0454	-675.9450	-645.4562	-1.8382	-2.0177
0.2246	0.31	1300	0.1102	-0.2548	-0.3030	0.4850	0.0482	-553.9783	-520.6531	-1.9584	-2.1449
0.2481	0.34	1400	0.1082	-0.2988	-0.3545	0.4955	0.0557	-605.4994	-564.6545	-1.8877	-2.0708
0.232	0.36	1500	0.1053	-0.2421	-0.2907	0.4910	0.0486	-541.7161	-508.0170	-1.9404	-2.1256
0.2351	0.38	1600	0.1098	-0.3383	-0.3864	0.4775	0.0481	-637.3510	-604.1564	-1.8506	-2.0290
0.2622	0.41	1700	0.1196	-0.2614	-0.3121	0.4820	0.0507	-563.0452	-527.2568	-1.9197	-2.1016
0.2043	0.43	1800	0.1257	-0.2798	-0.3252	0.4820	0.0454	-576.1965	-545.7018	-1.9177	-2.0980
0.2205	0.46	1900	0.1154	-0.4037	-0.4629	0.4850	0.0592	-713.9170	-669.5957	-1.8198	-1.9972
0.2156	0.48	2000	0.1103	-0.2727	-0.3161	0.4865	0.0434	-567.0794	-538.5911	-1.9234	-2.1044
0.2308	0.5	2100	0.1163	-0.4322	-0.4852	0.4925	0.0531	-736.1898	-698.0287	-1.8013	-1.9761
0.2204	0.53	2200	0.1083	-0.3224	-0.3712	0.4940	0.0488	-622.1750	-588.3229	-1.8487	-2.0260
0.2303	0.55	2300	0.1192	-0.3117	-0.3667	0.4940	0.0551	-617.7075	-577.5367	-1.8679	-2.0473
0.231	0.58	2400	0.1068	-0.3476	-0.4008	0.5	0.0532	-651.7600	-613.4935	-1.8167	-1.9926
0.2252	0.6	2500	0.1240	-0.3568	-0.4154	0.4940	0.0586	-666.3873	-622.7224	-1.9124	-2.0972
0.2445	0.62	2600	0.1240	-0.3426	-0.4003	0.4805	0.0576	-651.2365	-608.5200	-1.9230	-2.1073
0.2212	0.65	2700	0.1103	-0.2894	-0.3362	0.4925	0.0468	-587.1506	-555.2968	-1.9049	-2.0860
0.2301	0.67	2800	0.1073	-0.2754	-0.3278	0.5105	0.0524	-578.7745	-541.2313	-1.9024	-2.0838
0.2099	0.7	2900	0.1191	-0.3108	-0.3657	0.5015	0.0549	-616.7156	-576.6858	-1.9182	-2.1014
0.2072	0.72	3000	0.1120	-0.3062	-0.3563	0.4910	0.0500	-607.2319	-572.1099	-1.9258	-2.1090
0.2186	0.74	3100	0.1155	-0.2960	-0.3474	0.4985	0.0514	-598.4005	-561.9234	-1.9031	-2.0849
0.2743	0.77	3200	0.1121	-0.2815	-0.3314	0.4955	0.0499	-582.3980	-547.4086	-1.9332	-2.1170
0.1989	0.79	3300	0.1116	-0.3235	-0.3744	0.4850	0.0509	-625.3889	-589.4213	-1.8977	-2.0789
0.2258	0.82	3400	0.1093	-0.3091	-0.3603	0.4970	0.0512	-611.2418	-574.9766	-1.9164	-2.0989
0.2524	0.84	3500	0.1142	-0.3383	-0.3897	0.4910	0.0514	-640.6893	-604.2028	-1.9130	-2.0956
0.2202	0.86	3600	0.1173	-0.3412	-0.3925	0.4835	0.0513	-643.4937	-607.1244	-1.9146	-2.0973
0.2365	0.89	3700	0.1178	-0.3273	-0.3787	0.4850	0.0514	-629.6786	-593.2114	-1.9279	-2.1117
0.1894	0.91	3800	0.1152	-0.3184	-0.3694	0.4925	0.0509	-620.3304	-584.3237	-1.9252	-2.1088
0.2372	0.94	3900	0.1130	-0.3155	-0.3658	0.4940	0.0503	-616.7926	-581.3542	-1.9194	-2.1021
0.2029	0.96	4000	0.1133	-0.3208	-0.3715	0.4925	0.0507	-622.4911	-586.6887	-1.9141	-2.0964
0.2438	0.98	4100	0.1129	-0.3199	-0.3707	0.4940	0.0508	-621.6636	-585.7551	-1.9140	-2.0965

Framework versions

PEFT 0.7.1
Transformers 4.36.2
Pytorch 2.1.2+cu121
Datasets 2.14.6
Tokenizers 0.15.2

DUAL-GPO
/

zephyr-7b-gpo-v0-i1

zephyr-7b-gpo-v0-i1

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for DUAL-GPO/zephyr-7b-gpo-v0-i1

Dataset used to train DUAL-GPO/zephyr-7b-gpo-v0-i1

Evaluation results