model_hh_usp3_200

This model is a fine-tuned version of meta-llama/Llama-2-7b-chat-hf on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
8.0	100	1.8186	-5.9435	-8.8369	0.6300	2.8934	-120.4156	-116.9887	-0.0816	-0.0143
16.0	200	1.8025	-6.0024	-8.9192	0.6200	2.9168	-120.5071	-117.0542	-0.0812	-0.0142
24.0	300	1.7997	-6.0563	-8.9882	0.6300	2.9319	-120.5837	-117.1141	-0.0806	-0.0133
32.0	400	1.8278	-6.0899	-8.9796	0.6300	2.8898	-120.5742	-117.1513	-0.0794	-0.0122
40.0	500	1.8304	-6.1315	-9.0197	0.6300	2.8882	-120.6187	-117.1977	-0.0796	-0.0128
48.0	600	1.8260	-6.0887	-8.9905	0.6300	2.9018	-120.5863	-117.1501	-0.0804	-0.0131
56.0	700	1.8303	-6.1106	-8.9877	0.6300	2.8771	-120.5832	-117.1744	-0.0801	-0.0132
64.0	800	1.8139	-6.0927	-8.9885	0.6300	2.8957	-120.5840	-117.1545	-0.0786	-0.0117
72.0	900	1.8263	-6.1113	-8.9822	0.6200	2.8709	-120.5770	-117.1752	-0.0789	-0.0120
80.0	1000	1.8202	-6.0974	-8.9863	0.6300	2.8889	-120.5816	-117.1597	-0.0796	-0.0123