ds_chat_sppo_hard_new_iter0_2024-09-15-01.40

This model is a fine-tuned version of deepseek-ai/deepseek-llm-7b-chat on the self-generate/ds_chat_original_cn_mining_oj_iter0-binarized, the self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized and the self-generate/ds_chat_original_cn_rl_oj_iter0-binarized datasets. It achieves the following results on the evaluation set:

Loss: 0.4619
Rewards/chosen: 0.0067
Rewards/rejected: -0.0352
Rewards/accuracies: 0.5921
Rewards/margins: 0.0419
Logps/rejected: -263.1805
Logps/chosen: -252.2534
Logits/rejected: 1.4436
Logits/chosen: 1.3993
Debug/policy Chosen Logits: 1.3993
Debug/policy Rejected Logits: 1.4436
Debug/policy Chosen Logps: -252.2534
Debug/policy Rejected Logps: -263.1805
Debug/reference Chosen Logps: -252.9185
Debug/reference Rejected Logps: -259.6586

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-07
train_batch_size: 8
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
num_devices: 8
total_train_batch_size: 64
total_eval_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
lr_scheduler_warmup_steps: 100
num_epochs: 8.0

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen	Debug/policy Chosen Logits	Debug/policy Rejected Logits	Debug/policy Chosen Logps	Debug/policy Rejected Logps	Debug/reference Chosen Logps	Debug/reference Rejected Logps
0.4973	0.3623	100	0.4977	-0.0056	-0.0071	0.5132	0.0014	-260.3654	-253.4812	1.6987	1.6372	1.6372	1.6987	-253.4812	-260.3654	-252.9185	-259.6586
0.4917	0.7246	200	0.4919	-0.0069	-0.0126	0.5395	0.0058	-260.9230	-253.6065	1.6704	1.6087	1.6087	1.6704	-253.6065	-260.9230	-252.9185	-259.6586
0.4837	1.0870	300	0.4862	-0.0085	-0.0167	0.5789	0.0082	-261.3287	-253.7711	1.6490	1.5905	1.5905	1.6490	-253.7711	-261.3287	-252.9185	-259.6586
0.4821	1.4493	400	0.4822	-0.0046	-0.0173	0.5132	0.0127	-261.3844	-253.3754	1.6131	1.5560	1.5560	1.6131	-253.3754	-261.3844	-252.9185	-259.6586
0.4724	1.8116	500	0.4773	-0.0010	-0.0181	0.4737	0.0171	-261.4722	-253.0200	1.5870	1.5328	1.5328	1.5870	-253.0200	-261.4722	-252.9185	-259.6586
0.4677	2.1739	600	0.4750	-0.0007	-0.0218	0.5132	0.0212	-261.8435	-252.9872	1.5701	1.5167	1.5167	1.5701	-252.9872	-261.8435	-252.9185	-259.6586
0.4625	2.5362	700	0.5077	0.0917	0.0741	0.6447	0.0176	-252.2495	-243.7507	1.5700	1.5133	1.5133	1.5700	-243.7507	-252.2495	-252.9185	-259.6586
0.465	2.8986	800	0.4709	-0.0024	-0.0313	0.5658	0.0289	-262.7887	-253.1583	1.5298	1.4781	1.4781	1.5298	-253.1583	-262.7887	-252.9185	-259.6586
0.4551	3.2609	900	0.4689	-0.0039	-0.0344	0.5658	0.0304	-263.0977	-253.3132	1.5177	1.4670	1.4670	1.5177	-253.3132	-263.0977	-252.9185	-259.6586
0.4614	3.6232	1000	0.4687	-0.0108	-0.0450	0.5789	0.0342	-264.1606	-253.9997	1.5075	1.4592	1.4592	1.5075	-253.9997	-264.1606	-252.9185	-259.6586
0.4579	3.9855	1100	0.4668	0.0012	-0.0346	0.5789	0.0358	-263.1156	-252.7994	1.5016	1.4527	1.4527	1.5016	-252.7994	-263.1156	-252.9185	-259.6586
0.4466	4.3478	1200	0.4663	0.0006	-0.0344	0.5526	0.0349	-263.0953	-252.8606	1.4940	1.4448	1.4448	1.4940	-252.8606	-263.0953	-252.9185	-259.6586
0.4696	4.7101	1300	0.4644	0.0027	-0.0346	0.5921	0.0373	-263.1194	-252.6523	1.4687	1.4226	1.4226	1.4687	-252.6523	-263.1194	-252.9185	-259.6586
0.4571	5.0725	1400	0.4643	-0.0002	-0.0394	0.5789	0.0392	-263.5992	-252.9413	1.4644	1.4177	1.4177	1.4644	-252.9413	-263.5992	-252.9185	-259.6586
0.45	5.4348	1500	0.4637	0.0047	-0.0343	0.5789	0.0390	-263.0912	-252.4461	1.4551	1.4102	1.4102	1.4551	-252.4461	-263.0912	-252.9185	-259.6586
0.4561	5.7971	1600	0.4627	0.0063	-0.0340	0.5921	0.0403	-263.0588	-252.2838	1.4579	1.4127	1.4127	1.4579	-252.2838	-263.0588	-252.9185	-259.6586
0.4505	6.1594	1700	0.4616	0.0094	-0.0319	0.6316	0.0413	-262.8479	-251.9740	1.4445	1.4000	1.4000	1.4445	-251.9740	-262.8479	-252.9185	-259.6586
0.4563	6.5217	1800	0.4613	0.0084	-0.0356	0.6053	0.0440	-263.2198	-252.0771	1.4420	1.3981	1.3981	1.4420	-252.0771	-263.2198	-252.9185	-259.6586
0.4675	6.8841	1900	0.4616	0.0069	-0.0366	0.6053	0.0435	-263.3192	-252.2319	1.4424	1.3959	1.3959	1.4424	-252.2319	-263.3192	-252.9185	-259.6586
0.4502	7.2464	2000	0.4619	0.0071	-0.0364	0.5789	0.0435	-263.2976	-252.2066	1.4432	1.3985	1.3985	1.4432	-252.2066	-263.2976	-252.9185	-259.6586
0.4473	7.6087	2100	0.4623	0.0028	-0.0403	0.5921	0.0431	-263.6902	-252.6375	1.4423	1.3964	1.3964	1.4423	-252.6375	-263.6902	-252.9185	-259.6586
0.4508	7.9710	2200	0.4619	0.0067	-0.0352	0.5921	0.0419	-263.1805	-252.2534	1.4436	1.3993	1.3993	1.4436	-252.2534	-263.1805	-252.9185	-259.6586

Framework versions

Transformers 4.42.0
Pytorch 2.3.0+cu121
Datasets 2.14.6
Tokenizers 0.19.1

yiran-wang3
/

ds_chat_sppo_hard_new_iter0_nomask_linear_schedule

ds_chat_sppo_hard_new_iter0_2024-09-15-01.40

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for yiran-wang3/ds_chat_sppo_hard_new_iter0_nomask_linear_schedule

Datasets used to train yiran-wang3/ds_chat_sppo_hard_new_iter0_nomask_linear_schedule

Evaluation results