ds_chat_sppo_hard_iter0_2024-09-14-21.15

This model is a fine-tuned version of deepseek-ai/deepseek-llm-7b-chat on the self-generate/ds_chat_original_cn_mining_oj_iter0-binarized, the self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized and the self-generate/ds_chat_original_cn_rl_oj_iter0-binarized datasets. It achieves the following results on the evaluation set:

Loss: 4952.6191
Rewards/chosen: 0.0173
Rewards/rejected: 0.0003
Rewards/accuracies: 0.2763
Rewards/margins: 0.0170
Logps/rejected: -63.8573
Logps/chosen: -121.4135
Logits/rejected: 1.7167
Logits/chosen: 1.6591
Debug/policy Chosen Logits: 1.6591
Debug/policy Rejected Logits: 1.7167
Debug/policy Chosen Logps: -121.4135
Debug/policy Rejected Logps: -63.8573
Debug/reference Chosen Logps: -123.1481
Debug/reference Rejected Logps: -63.8871
Debug/sppo Chosen Reward In Loss: 1.7345
Debug/sppo Rej Reward In Loss: 0.0297
Debug/sppo Chosen Loss: 2393.8552
Debug/sppo Reject Loss: 2503.0667

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-07
train_batch_size: 8
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
num_devices: 8
total_train_batch_size: 64
total_eval_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
lr_scheduler_warmup_steps: 100
num_epochs: 8.0

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen	Debug/policy Chosen Logits	Debug/policy Rejected Logits	Debug/policy Chosen Logps	Debug/policy Rejected Logps	Debug/reference Chosen Logps	Debug/reference Rejected Logps	Debug/sppo Chosen Reward In Loss	Debug/sppo Rej Reward In Loss	Debug/sppo Chosen Loss	Debug/sppo Reject Loss
4997.2328	0.3623	100	4979.2803	0.0044	-0.0011	0.3289	0.0055	-64.0002	-122.7075	1.7218	1.6606	1.6606	1.7218	-122.7075	-64.0002	-123.1481	-63.8871	0.4405	-0.1132	2463.0359	2489.7429
5010.2789	0.7246	200	4991.7910	0.0172	0.0053	0.3289	0.0119	-63.3570	-121.4287	1.7384	1.6785	1.6785	1.7384	-121.4287	-63.3570	-123.1481	-63.8871	1.7193	0.5301	2393.0474	2574.4319
4985.9242	1.0870	300	4983.2910	0.0172	0.0045	0.3026	0.0128	-63.4403	-121.4232	1.7425	1.6831	1.6831	1.7425	-121.4232	-63.4403	-123.1481	-63.8871	1.7249	0.4468	2390.1040	2560.6348
5008.7777	1.4493	400	4973.2788	0.0150	0.0051	0.3289	0.0099	-63.3768	-121.6436	1.7315	1.6724	1.6724	1.7315	-121.6436	-63.3768	-123.1481	-63.8871	1.5044	0.5103	2394.0208	2569.0662
5014.366	1.8116	500	4963.4956	0.0126	0.0012	0.2895	0.0114	-63.7654	-121.8871	1.7289	1.6691	1.6691	1.7289	-121.8871	-63.7654	-123.1481	-63.8871	1.2610	0.1216	2407.8684	2513.1951
4949.5211	2.1739	600	4968.5161	0.0164	0.0024	0.2895	0.0140	-63.6428	-121.5044	1.7287	1.6694	1.6694	1.7287	-121.5044	-63.6428	-123.1481	-63.8871	1.6436	0.2443	2388.6809	2529.8535
4995.5281	2.5362	700	4965.4644	0.0172	0.0029	0.3684	0.0143	-63.5985	-121.4247	1.7321	1.6727	1.6727	1.7321	-121.4247	-63.5985	-123.1481	-63.8871	1.7233	0.2886	2388.6721	2533.9565
4969.6547	2.8986	800	4971.4702	0.0216	0.0059	0.3684	0.0157	-63.2935	-120.9840	1.7477	1.6868	1.6868	1.7477	-120.9840	-63.2935	-123.1481	-63.8871	2.1640	0.5935	2372.7554	2588.5020
4953.4711	3.2609	900	4955.8784	0.0187	0.0036	0.3026	0.0152	-63.5316	-121.2758	1.7427	1.6827	1.6827	1.7427	-121.2758	-63.5316	-123.1481	-63.8871	1.8722	0.3555	2372.4011	2545.3831
4961.9289	3.6232	1000	4967.9907	0.0209	0.0059	0.3026	0.0150	-63.3005	-121.0624	1.7481	1.6892	1.6892	1.7481	-121.0624	-63.3005	-123.1481	-63.8871	2.0856	0.5865	2372.6270	2586.2114
4979.5078	3.9855	1100	4955.5312	0.0142	0.0005	0.3158	0.0138	-63.8419	-121.7271	1.7192	1.6605	1.6605	1.7192	-121.7271	-63.8419	-123.1481	-63.8871	1.4210	0.0452	2399.0156	2504.7903
4991.6695	4.3478	1200	4958.5435	0.0144	0.0012	0.3026	0.0133	-63.7715	-121.7064	1.7235	1.6634	1.6634	1.7235	-121.7064	-63.7715	-123.1481	-63.8871	1.4416	0.1155	2397.6743	2512.6323
4979.216	4.7101	1300	4964.5874	0.0206	0.0044	0.2763	0.0162	-63.4478	-121.0882	1.7125	1.6538	1.6538	1.7125	-121.0882	-63.4478	-123.1481	-63.8871	2.0598	0.4393	2382.0825	2559.9192
4971.2352	5.0725	1400	4960.7969	0.0177	-0.0003	0.3158	0.0180	-63.9134	-121.3772	1.7162	1.6581	1.6581	1.7162	-121.3772	-63.9134	-123.1481	-63.8871	1.7709	-0.0264	2388.1853	2497.3933
4934.9098	5.4348	1500	4958.9351	0.0189	0.0008	0.3026	0.0181	-63.8062	-121.2587	1.7177	1.6574	1.6574	1.7177	-121.2587	-63.8062	-123.1481	-63.8871	1.8893	0.0808	2388.3345	2508.9651
4983.5867	5.7971	1600	4956.6689	0.0176	0.0012	0.2763	0.0164	-63.7669	-121.3872	1.7142	1.6548	1.6548	1.7142	-121.3872	-63.7669	-123.1481	-63.8871	1.7609	0.1201	2383.7910	2513.2532
4934.0355	6.1594	1700	4958.1274	0.0174	-0.0002	0.25	0.0175	-63.9030	-121.4107	1.7053	1.6455	1.6455	1.7053	-121.4107	-63.9030	-123.1481	-63.8871	1.7373	-0.0159	2402.3301	2498.4744
4962.0086	6.5217	1800	4966.0581	0.0219	0.0012	0.3289	0.0207	-63.7644	-120.9535	1.7137	1.6560	1.6560	1.7137	-120.9535	-63.7644	-123.1481	-63.8871	2.1945	0.1226	2383.6753	2514.6057
4963.9734	6.8841	1900	4958.1865	0.0215	0.0013	0.3026	0.0202	-63.7605	-120.9998	1.7137	1.6550	1.6550	1.7137	-120.9998	-63.7605	-123.1481	-63.8871	2.1483	0.1265	2384.9424	2514.3125
4951.3387	7.2464	2000	4958.5044	0.0208	0.0020	0.3158	0.0189	-63.6920	-121.0652	1.7131	1.6545	1.6545	1.7131	-121.0652	-63.6920	-123.1481	-63.8871	2.0829	0.1950	2385.3457	2523.0876
4969.7758	7.6087	2100	4950.9175	0.0165	-0.0004	0.3421	0.0169	-63.9299	-121.4973	1.7156	1.6569	1.6569	1.7156	-121.4973	-63.9299	-123.1481	-63.8871	1.6508	-0.0429	2386.9766	2495.8533
4946.4094	7.9710	2200	4952.6191	0.0173	0.0003	0.2763	0.0170	-63.8573	-121.4135	1.7167	1.6591	1.6591	1.7167	-121.4135	-63.8573	-123.1481	-63.8871	1.7345	0.0297	2393.8552	2503.0667

Framework versions

Transformers 4.42.0
Pytorch 2.3.0+cu121
Datasets 2.14.6
Tokenizers 0.19.1

yiran-wang3
/

ds_chat_sppo_hard_iter0_masked_linear_schedule

ds_chat_sppo_hard_iter0_2024-09-14-21.15

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for yiran-wang3/ds_chat_sppo_hard_iter0_masked_linear_schedule

Datasets used to train yiran-wang3/ds_chat_sppo_hard_iter0_masked_linear_schedule

Evaluation results