silviasapora commited on
Commit
1550e8f
·
verified ·
1 Parent(s): e19d20c

Model save

Browse files
Files changed (4) hide show
  1. README.md +67 -0
  2. all_results.json +9 -0
  3. train_results.json +9 -0
  4. trainer_state.json +1176 -0
README.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: mistralai/Mistral-7B-v0.3
3
+ library_name: transformers
4
+ model_name: mistral-7b-orpo-noisy-6e-5
5
+ tags:
6
+ - generated_from_trainer
7
+ - trl
8
+ - orpo
9
+ licence: license
10
+ ---
11
+
12
+ # Model Card for mistral-7b-orpo-noisy-6e-5
13
+
14
+ This model is a fine-tuned version of [mistralai/Mistral-7B-v0.3](https://huggingface.co/mistralai/Mistral-7B-v0.3).
15
+ It has been trained using [TRL](https://github.com/huggingface/trl).
16
+
17
+ ## Quick start
18
+
19
+ ```python
20
+ from transformers import pipeline
21
+
22
+ question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
23
+ generator = pipeline("text-generation", model="silviasapora/mistral-7b-orpo-noisy-6e-5", device="cuda")
24
+ output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
25
+ print(output["generated_text"])
26
+ ```
27
+
28
+ ## Training procedure
29
+
30
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/silvias/huggingface/runs/5sqhouro)
31
+
32
+
33
+ This model was trained with ORPO, a method introduced in [ORPO: Monolithic Preference Optimization without Reference Model](https://huggingface.co/papers/2403.07691).
34
+
35
+ ### Framework versions
36
+
37
+ - TRL: 0.13.0
38
+ - Transformers: 4.46.1
39
+ - Pytorch: 2.4.0
40
+ - Datasets: 3.1.0
41
+ - Tokenizers: 0.20.1
42
+
43
+ ## Citations
44
+
45
+ Cite ORPO as:
46
+
47
+ ```bibtex
48
+ @article{hong2024orpo,
49
+ title = {{ORPO: Monolithic Preference Optimization without Reference Model}},
50
+ author = {Jiwoo Hong and Noah Lee and James Thorne},
51
+ year = 2024,
52
+ eprint = {arXiv:2403.07691}
53
+ }
54
+ ```
55
+
56
+ Cite TRL as:
57
+
58
+ ```bibtex
59
+ @misc{vonwerra2022trl,
60
+ title = {{TRL: Transformer Reinforcement Learning}},
61
+ author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
62
+ year = 2020,
63
+ journal = {GitHub repository},
64
+ publisher = {GitHub},
65
+ howpublished = {\url{https://github.com/huggingface/trl}}
66
+ }
67
+ ```
all_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 2.986666666666667,
3
+ "total_flos": 0.0,
4
+ "train_loss": 31.135096304757255,
5
+ "train_runtime": 6745.6063,
6
+ "train_samples": 6750,
7
+ "train_samples_per_second": 3.002,
8
+ "train_steps_per_second": 0.047
9
+ }
train_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 2.986666666666667,
3
+ "total_flos": 0.0,
4
+ "train_loss": 31.135096304757255,
5
+ "train_runtime": 6745.6063,
6
+ "train_samples": 6750,
7
+ "train_samples_per_second": 3.002,
8
+ "train_steps_per_second": 0.047
9
+ }
trainer_state.json ADDED
@@ -0,0 +1,1176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 2.986666666666667,
5
+ "eval_steps": 500,
6
+ "global_step": 315,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.047407407407407405,
13
+ "grad_norm": 270.0,
14
+ "learning_rate": 9.375000000000001e-06,
15
+ "log_odds_chosen": 0.41771596670150757,
16
+ "log_odds_ratio": -0.7694265246391296,
17
+ "logits/chosen": -2.967926502227783,
18
+ "logits/rejected": -2.8778510093688965,
19
+ "logps/chosen": -1.2910274267196655,
20
+ "logps/rejected": -1.6328433752059937,
21
+ "loss": 51.9175,
22
+ "nll_loss": 1.511154294013977,
23
+ "rewards/accuracies": 0.574999988079071,
24
+ "rewards/chosen": -0.06455137580633163,
25
+ "rewards/margins": 0.017090797424316406,
26
+ "rewards/rejected": -0.08164217323064804,
27
+ "step": 5
28
+ },
29
+ {
30
+ "epoch": 0.09481481481481481,
31
+ "grad_norm": 61.75,
32
+ "learning_rate": 1.8750000000000002e-05,
33
+ "log_odds_chosen": 0.2715613842010498,
34
+ "log_odds_ratio": -0.7063366174697876,
35
+ "logits/chosen": -2.903649091720581,
36
+ "logits/rejected": -2.737760066986084,
37
+ "logps/chosen": -1.0549781322479248,
38
+ "logps/rejected": -1.2600512504577637,
39
+ "loss": 47.142,
40
+ "nll_loss": 1.388285517692566,
41
+ "rewards/accuracies": 0.59375,
42
+ "rewards/chosen": -0.0527489073574543,
43
+ "rewards/margins": 0.010253657586872578,
44
+ "rewards/rejected": -0.0630025640130043,
45
+ "step": 10
46
+ },
47
+ {
48
+ "epoch": 0.14222222222222222,
49
+ "grad_norm": 71.0,
50
+ "learning_rate": 2.8125e-05,
51
+ "log_odds_chosen": 0.26208820939064026,
52
+ "log_odds_ratio": -0.6782652139663696,
53
+ "logits/chosen": -2.5858330726623535,
54
+ "logits/rejected": -2.4748170375823975,
55
+ "logps/chosen": -0.921225368976593,
56
+ "logps/rejected": -1.081853985786438,
57
+ "loss": 46.5682,
58
+ "nll_loss": 1.4536203145980835,
59
+ "rewards/accuracies": 0.6187499761581421,
60
+ "rewards/chosen": -0.04606126993894577,
61
+ "rewards/margins": 0.008031422272324562,
62
+ "rewards/rejected": -0.05409269407391548,
63
+ "step": 15
64
+ },
65
+ {
66
+ "epoch": 0.18962962962962962,
67
+ "grad_norm": 92.0,
68
+ "learning_rate": 3.7500000000000003e-05,
69
+ "log_odds_chosen": 0.18918053805828094,
70
+ "log_odds_ratio": -0.7011532783508301,
71
+ "logits/chosen": -2.4796886444091797,
72
+ "logits/rejected": -2.3676044940948486,
73
+ "logps/chosen": -0.8980884552001953,
74
+ "logps/rejected": -1.0366885662078857,
75
+ "loss": 43.8254,
76
+ "nll_loss": 1.3132244348526,
77
+ "rewards/accuracies": 0.606249988079071,
78
+ "rewards/chosen": -0.044904422014951706,
79
+ "rewards/margins": 0.00693000853061676,
80
+ "rewards/rejected": -0.05183442682027817,
81
+ "step": 20
82
+ },
83
+ {
84
+ "epoch": 0.23703703703703705,
85
+ "grad_norm": 39.75,
86
+ "learning_rate": 4.6875e-05,
87
+ "log_odds_chosen": 0.20047792792320251,
88
+ "log_odds_ratio": -0.704607367515564,
89
+ "logits/chosen": -2.486077070236206,
90
+ "logits/rejected": -2.380354881286621,
91
+ "logps/chosen": -0.9169360995292664,
92
+ "logps/rejected": -1.0517938137054443,
93
+ "loss": 41.7393,
94
+ "nll_loss": 1.3125925064086914,
95
+ "rewards/accuracies": 0.5625,
96
+ "rewards/chosen": -0.04584680497646332,
97
+ "rewards/margins": 0.006742885801941156,
98
+ "rewards/rejected": -0.052589692175388336,
99
+ "step": 25
100
+ },
101
+ {
102
+ "epoch": 0.28444444444444444,
103
+ "grad_norm": 48.25,
104
+ "learning_rate": 5.625e-05,
105
+ "log_odds_chosen": 0.13125675916671753,
106
+ "log_odds_ratio": -0.714677631855011,
107
+ "logits/chosen": -2.498034954071045,
108
+ "logits/rejected": -2.1313350200653076,
109
+ "logps/chosen": -0.8735687136650085,
110
+ "logps/rejected": -0.9592132568359375,
111
+ "loss": 41.493,
112
+ "nll_loss": 1.2559503316879272,
113
+ "rewards/accuracies": 0.5687500238418579,
114
+ "rewards/chosen": -0.04367843270301819,
115
+ "rewards/margins": 0.004282232839614153,
116
+ "rewards/rejected": -0.047960661351680756,
117
+ "step": 30
118
+ },
119
+ {
120
+ "epoch": 0.33185185185185184,
121
+ "grad_norm": 39.0,
122
+ "learning_rate": 5.998336508818541e-05,
123
+ "log_odds_chosen": 0.06586463749408722,
124
+ "log_odds_ratio": -0.7540255188941956,
125
+ "logits/chosen": -2.3031373023986816,
126
+ "logits/rejected": -2.4485721588134766,
127
+ "logps/chosen": -0.8969869613647461,
128
+ "logps/rejected": -0.9382694363594055,
129
+ "loss": 40.5282,
130
+ "nll_loss": 1.2521088123321533,
131
+ "rewards/accuracies": 0.53125,
132
+ "rewards/chosen": -0.04484934359788895,
133
+ "rewards/margins": 0.0020641214214265347,
134
+ "rewards/rejected": -0.04691346734762192,
135
+ "step": 35
136
+ },
137
+ {
138
+ "epoch": 0.37925925925925924,
139
+ "grad_norm": 35.25,
140
+ "learning_rate": 5.988177409372154e-05,
141
+ "log_odds_chosen": 0.21956849098205566,
142
+ "log_odds_ratio": -0.6834356188774109,
143
+ "logits/chosen": -2.2723240852355957,
144
+ "logits/rejected": -2.065520763397217,
145
+ "logps/chosen": -0.8550432324409485,
146
+ "logps/rejected": -0.9936239123344421,
147
+ "loss": 40.1999,
148
+ "nll_loss": 1.2111032009124756,
149
+ "rewards/accuracies": 0.550000011920929,
150
+ "rewards/chosen": -0.04275216534733772,
151
+ "rewards/margins": 0.006929035298526287,
152
+ "rewards/rejected": -0.049681201577186584,
153
+ "step": 40
154
+ },
155
+ {
156
+ "epoch": 0.4266666666666667,
157
+ "grad_norm": 33.0,
158
+ "learning_rate": 5.968814624645376e-05,
159
+ "log_odds_chosen": 0.1635906845331192,
160
+ "log_odds_ratio": -0.7335126996040344,
161
+ "logits/chosen": -1.9793834686279297,
162
+ "logits/rejected": -2.0152127742767334,
163
+ "logps/chosen": -0.8692102432250977,
164
+ "logps/rejected": -0.975513756275177,
165
+ "loss": 40.1299,
166
+ "nll_loss": 1.2306907176971436,
167
+ "rewards/accuracies": 0.543749988079071,
168
+ "rewards/chosen": -0.04346051067113876,
169
+ "rewards/margins": 0.005315178073942661,
170
+ "rewards/rejected": -0.04877568781375885,
171
+ "step": 45
172
+ },
173
+ {
174
+ "epoch": 0.4740740740740741,
175
+ "grad_norm": 39.75,
176
+ "learning_rate": 5.9403077926557534e-05,
177
+ "log_odds_chosen": 0.16285523772239685,
178
+ "log_odds_ratio": -0.7225431203842163,
179
+ "logits/chosen": -1.9700477123260498,
180
+ "logits/rejected": -1.93939208984375,
181
+ "logps/chosen": -0.9150403738021851,
182
+ "logps/rejected": -1.005976676940918,
183
+ "loss": 42.3638,
184
+ "nll_loss": 1.3180006742477417,
185
+ "rewards/accuracies": 0.5874999761581421,
186
+ "rewards/chosen": -0.04575202241539955,
187
+ "rewards/margins": 0.004546813666820526,
188
+ "rewards/rejected": -0.05029883235692978,
189
+ "step": 50
190
+ },
191
+ {
192
+ "epoch": 0.5214814814814814,
193
+ "grad_norm": 32.5,
194
+ "learning_rate": 5.9027447153889215e-05,
195
+ "log_odds_chosen": 0.074959896504879,
196
+ "log_odds_ratio": -0.7347756624221802,
197
+ "logits/chosen": -1.8091471195220947,
198
+ "logits/rejected": -1.627111792564392,
199
+ "logps/chosen": -0.8783036470413208,
200
+ "logps/rejected": -0.9295312166213989,
201
+ "loss": 39.4889,
202
+ "nll_loss": 1.2200506925582886,
203
+ "rewards/accuracies": 0.543749988079071,
204
+ "rewards/chosen": -0.04391518235206604,
205
+ "rewards/margins": 0.002561377827078104,
206
+ "rewards/rejected": -0.046476561576128006,
207
+ "step": 55
208
+ },
209
+ {
210
+ "epoch": 0.5688888888888889,
211
+ "grad_norm": 31.875,
212
+ "learning_rate": 5.856241088365584e-05,
213
+ "log_odds_chosen": 0.21836993098258972,
214
+ "log_odds_ratio": -0.6648741960525513,
215
+ "logits/chosen": -2.2059853076934814,
216
+ "logits/rejected": -1.8393570184707642,
217
+ "logps/chosen": -0.8266533613204956,
218
+ "logps/rejected": -0.944961428642273,
219
+ "loss": 38.4828,
220
+ "nll_loss": 1.1561448574066162,
221
+ "rewards/accuracies": 0.5625,
222
+ "rewards/chosen": -0.0413326658308506,
223
+ "rewards/margins": 0.005915405694395304,
224
+ "rewards/rejected": -0.04724807292222977,
225
+ "step": 60
226
+ },
227
+ {
228
+ "epoch": 0.6162962962962963,
229
+ "grad_norm": 31.75,
230
+ "learning_rate": 5.800940144295476e-05,
231
+ "log_odds_chosen": 0.14650335907936096,
232
+ "log_odds_ratio": -0.7161463499069214,
233
+ "logits/chosen": -1.9598195552825928,
234
+ "logits/rejected": -1.8855581283569336,
235
+ "logps/chosen": -0.890237033367157,
236
+ "logps/rejected": -0.9888992309570312,
237
+ "loss": 38.4474,
238
+ "nll_loss": 1.1701605319976807,
239
+ "rewards/accuracies": 0.543749988079071,
240
+ "rewards/chosen": -0.04451185464859009,
241
+ "rewards/margins": 0.004933114163577557,
242
+ "rewards/rejected": -0.04944496601819992,
243
+ "step": 65
244
+ },
245
+ {
246
+ "epoch": 0.6637037037037037,
247
+ "grad_norm": 31.25,
248
+ "learning_rate": 5.7370122119158855e-05,
249
+ "log_odds_chosen": 0.2176527976989746,
250
+ "log_odds_ratio": -0.6846314072608948,
251
+ "logits/chosen": -2.401538610458374,
252
+ "logits/rejected": -1.8277839422225952,
253
+ "logps/chosen": -0.8448864817619324,
254
+ "logps/rejected": -1.0054863691329956,
255
+ "loss": 38.0337,
256
+ "nll_loss": 1.1650002002716064,
257
+ "rewards/accuracies": 0.59375,
258
+ "rewards/chosen": -0.0422443225979805,
259
+ "rewards/margins": 0.00802999921143055,
260
+ "rewards/rejected": -0.0502743236720562,
261
+ "step": 70
262
+ },
263
+ {
264
+ "epoch": 0.7111111111111111,
265
+ "grad_norm": 34.5,
266
+ "learning_rate": 5.6646541913735056e-05,
267
+ "log_odds_chosen": 0.34958410263061523,
268
+ "log_odds_ratio": -0.6144381761550903,
269
+ "logits/chosen": -1.9800045490264893,
270
+ "logits/rejected": -2.1104648113250732,
271
+ "logps/chosen": -0.7909008264541626,
272
+ "logps/rejected": -1.0040924549102783,
273
+ "loss": 38.1687,
274
+ "nll_loss": 1.1795135736465454,
275
+ "rewards/accuracies": 0.675000011920929,
276
+ "rewards/chosen": -0.03954503685235977,
277
+ "rewards/margins": 0.01065958570688963,
278
+ "rewards/rejected": -0.050204623490571976,
279
+ "step": 75
280
+ },
281
+ {
282
+ "epoch": 0.7585185185185185,
283
+ "grad_norm": 30.75,
284
+ "learning_rate": 5.5840889477654665e-05,
285
+ "log_odds_chosen": 0.25651440024375916,
286
+ "log_odds_ratio": -0.6959986686706543,
287
+ "logits/chosen": -2.3197991847991943,
288
+ "logits/rejected": -1.9464877843856812,
289
+ "logps/chosen": -0.8813208341598511,
290
+ "logps/rejected": -1.0397270917892456,
291
+ "loss": 37.8718,
292
+ "nll_loss": 1.1892088651657104,
293
+ "rewards/accuracies": 0.612500011920929,
294
+ "rewards/chosen": -0.04406604543328285,
295
+ "rewards/margins": 0.007920312695205212,
296
+ "rewards/rejected": -0.05198635905981064,
297
+ "step": 80
298
+ },
299
+ {
300
+ "epoch": 0.8059259259259259,
301
+ "grad_norm": 26.75,
302
+ "learning_rate": 5.495564624707466e-05,
303
+ "log_odds_chosen": 0.24601168930530548,
304
+ "log_odds_ratio": -0.6635450720787048,
305
+ "logits/chosen": -2.3646152019500732,
306
+ "logits/rejected": -1.6053569316864014,
307
+ "logps/chosen": -0.825157642364502,
308
+ "logps/rejected": -0.9831274151802063,
309
+ "loss": 37.7402,
310
+ "nll_loss": 1.1433099508285522,
311
+ "rewards/accuracies": 0.625,
312
+ "rewards/chosen": -0.04125788062810898,
313
+ "rewards/margins": 0.007898489013314247,
314
+ "rewards/rejected": -0.04915637522935867,
315
+ "step": 85
316
+ },
317
+ {
318
+ "epoch": 0.8533333333333334,
319
+ "grad_norm": 32.25,
320
+ "learning_rate": 5.399353880043222e-05,
321
+ "log_odds_chosen": 0.2743232548236847,
322
+ "log_odds_ratio": -0.6547017097473145,
323
+ "logits/chosen": -2.2998695373535156,
324
+ "logits/rejected": -1.9147183895111084,
325
+ "logps/chosen": -0.7987005710601807,
326
+ "logps/rejected": -0.9570469856262207,
327
+ "loss": 39.0598,
328
+ "nll_loss": 1.165475606918335,
329
+ "rewards/accuracies": 0.643750011920929,
330
+ "rewards/chosen": -0.03993503004312515,
331
+ "rewards/margins": 0.00791732408106327,
332
+ "rewards/rejected": -0.047852352261543274,
333
+ "step": 90
334
+ },
335
+ {
336
+ "epoch": 0.9007407407407407,
337
+ "grad_norm": 27.875,
338
+ "learning_rate": 5.295753046049293e-05,
339
+ "log_odds_chosen": 0.3104208707809448,
340
+ "log_odds_ratio": -0.6332544088363647,
341
+ "logits/chosen": -2.3582139015197754,
342
+ "logits/rejected": -1.8759187459945679,
343
+ "logps/chosen": -0.7584289908409119,
344
+ "logps/rejected": -0.9375408887863159,
345
+ "loss": 38.0891,
346
+ "nll_loss": 1.0971782207489014,
347
+ "rewards/accuracies": 0.6625000238418579,
348
+ "rewards/chosen": -0.037921447306871414,
349
+ "rewards/margins": 0.00895559974014759,
350
+ "rewards/rejected": -0.046877048909664154,
351
+ "step": 95
352
+ },
353
+ {
354
+ "epoch": 0.9481481481481482,
355
+ "grad_norm": 28.0,
356
+ "learning_rate": 5.1850812167218644e-05,
357
+ "log_odds_chosen": 0.14483532309532166,
358
+ "log_odds_ratio": -0.725937008857727,
359
+ "logits/chosen": -2.2758800983428955,
360
+ "logits/rejected": -1.807739019393921,
361
+ "logps/chosen": -0.8715543746948242,
362
+ "logps/rejected": -0.967276394367218,
363
+ "loss": 38.3176,
364
+ "nll_loss": 1.192031979560852,
365
+ "rewards/accuracies": 0.550000011920929,
366
+ "rewards/chosen": -0.04357772320508957,
367
+ "rewards/margins": 0.004786101635545492,
368
+ "rewards/rejected": -0.0483638234436512,
369
+ "step": 100
370
+ },
371
+ {
372
+ "epoch": 0.9955555555555555,
373
+ "grad_norm": 27.5,
374
+ "learning_rate": 5.067679264956681e-05,
375
+ "log_odds_chosen": 0.2537747621536255,
376
+ "log_odds_ratio": -0.6505337953567505,
377
+ "logits/chosen": -2.356247901916504,
378
+ "logits/rejected": -1.834238052368164,
379
+ "logps/chosen": -0.8050671815872192,
380
+ "logps/rejected": -0.9783474206924438,
381
+ "loss": 37.2872,
382
+ "nll_loss": 1.1223804950714111,
383
+ "rewards/accuracies": 0.6187499761581421,
384
+ "rewards/chosen": -0.04025335982441902,
385
+ "rewards/margins": 0.008664008229970932,
386
+ "rewards/rejected": -0.04891737177968025,
387
+ "step": 105
388
+ },
389
+ {
390
+ "epoch": 1.0429629629629629,
391
+ "grad_norm": 26.625,
392
+ "learning_rate": 4.943908792649255e-05,
393
+ "log_odds_chosen": 0.7013475298881531,
394
+ "log_odds_ratio": -0.5124364495277405,
395
+ "logits/chosen": -2.1879847049713135,
396
+ "logits/rejected": -1.7397973537445068,
397
+ "logps/chosen": -0.6286161541938782,
398
+ "logps/rejected": -0.9900426864624023,
399
+ "loss": 31.0738,
400
+ "nll_loss": 0.9219255447387695,
401
+ "rewards/accuracies": 0.75,
402
+ "rewards/chosen": -0.03143080696463585,
403
+ "rewards/margins": 0.018071329221129417,
404
+ "rewards/rejected": -0.049502138048410416,
405
+ "step": 110
406
+ },
407
+ {
408
+ "epoch": 1.0903703703703704,
409
+ "grad_norm": 30.25,
410
+ "learning_rate": 4.814151016949061e-05,
411
+ "log_odds_chosen": 0.9226773977279663,
412
+ "log_odds_ratio": -0.42190057039260864,
413
+ "logits/chosen": -2.087578058242798,
414
+ "logits/rejected": -1.6346263885498047,
415
+ "logps/chosen": -0.5833232998847961,
416
+ "logps/rejected": -1.0628697872161865,
417
+ "loss": 30.0304,
418
+ "nll_loss": 0.8877116441726685,
419
+ "rewards/accuracies": 0.800000011920929,
420
+ "rewards/chosen": -0.02916616201400757,
421
+ "rewards/margins": 0.02397732436656952,
422
+ "rewards/rejected": -0.053143490105867386,
423
+ "step": 115
424
+ },
425
+ {
426
+ "epoch": 1.1377777777777778,
427
+ "grad_norm": 28.375,
428
+ "learning_rate": 4.6788055960981e-05,
429
+ "log_odds_chosen": 0.9922162294387817,
430
+ "log_odds_ratio": -0.40503960847854614,
431
+ "logits/chosen": -2.222867488861084,
432
+ "logits/rejected": -1.8629436492919922,
433
+ "logps/chosen": -0.5679124593734741,
434
+ "logps/rejected": -1.079012155532837,
435
+ "loss": 30.4758,
436
+ "nll_loss": 0.9042154550552368,
437
+ "rewards/accuracies": 0.8500000238418579,
438
+ "rewards/chosen": -0.028395622968673706,
439
+ "rewards/margins": 0.02555498108267784,
440
+ "rewards/rejected": -0.053950607776641846,
441
+ "step": 120
442
+ },
443
+ {
444
+ "epoch": 1.1851851851851851,
445
+ "grad_norm": 27.625,
446
+ "learning_rate": 4.538289398470304e-05,
447
+ "log_odds_chosen": 0.9345417022705078,
448
+ "log_odds_ratio": -0.4391079545021057,
449
+ "logits/chosen": -2.159498453140259,
450
+ "logits/rejected": -1.9323101043701172,
451
+ "logps/chosen": -0.6106966733932495,
452
+ "logps/rejected": -1.0727020502090454,
453
+ "loss": 28.3606,
454
+ "nll_loss": 0.874626636505127,
455
+ "rewards/accuracies": 0.8125,
456
+ "rewards/chosen": -0.030534833669662476,
457
+ "rewards/margins": 0.023100275546312332,
458
+ "rewards/rejected": -0.05363510921597481,
459
+ "step": 125
460
+ },
461
+ {
462
+ "epoch": 1.2325925925925927,
463
+ "grad_norm": 25.75,
464
+ "learning_rate": 4.393035218603139e-05,
465
+ "log_odds_chosen": 0.7891913652420044,
466
+ "log_odds_ratio": -0.4749962389469147,
467
+ "logits/chosen": -2.1488184928894043,
468
+ "logits/rejected": -1.8161399364471436,
469
+ "logps/chosen": -0.6154332160949707,
470
+ "logps/rejected": -1.0038516521453857,
471
+ "loss": 29.5153,
472
+ "nll_loss": 0.9258828163146973,
473
+ "rewards/accuracies": 0.7749999761581421,
474
+ "rewards/chosen": -0.030771661549806595,
475
+ "rewards/margins": 0.019420918077230453,
476
+ "rewards/rejected": -0.05019258335232735,
477
+ "step": 130
478
+ },
479
+ {
480
+ "epoch": 1.28,
481
+ "grad_norm": 26.125,
482
+ "learning_rate": 4.243490444176123e-05,
483
+ "log_odds_chosen": 0.912939190864563,
484
+ "log_odds_ratio": -0.4465225338935852,
485
+ "logits/chosen": -2.023993492126465,
486
+ "logits/rejected": -1.8624738454818726,
487
+ "logps/chosen": -0.5600326061248779,
488
+ "logps/rejected": -1.0113087892532349,
489
+ "loss": 29.4182,
490
+ "nll_loss": 0.8695603609085083,
491
+ "rewards/accuracies": 0.793749988079071,
492
+ "rewards/chosen": -0.028001630678772926,
493
+ "rewards/margins": 0.022563805803656578,
494
+ "rewards/rejected": -0.0505654402077198,
495
+ "step": 135
496
+ },
497
+ {
498
+ "epoch": 1.3274074074074074,
499
+ "grad_norm": 24.875,
500
+ "learning_rate": 4.090115678041962e-05,
501
+ "log_odds_chosen": 0.8396116495132446,
502
+ "log_odds_ratio": -0.47167715430259705,
503
+ "logits/chosen": -1.9779258966445923,
504
+ "logits/rejected": -1.8630996942520142,
505
+ "logps/chosen": -0.6393855214118958,
506
+ "logps/rejected": -1.073813557624817,
507
+ "loss": 30.4644,
508
+ "nll_loss": 0.9436683654785156,
509
+ "rewards/accuracies": 0.793749988079071,
510
+ "rewards/chosen": -0.031969279050827026,
511
+ "rewards/margins": 0.021721404045820236,
512
+ "rewards/rejected": -0.053690679371356964,
513
+ "step": 140
514
+ },
515
+ {
516
+ "epoch": 1.374814814814815,
517
+ "grad_norm": 27.0,
518
+ "learning_rate": 3.9333833195545325e-05,
519
+ "log_odds_chosen": 0.8570321798324585,
520
+ "log_odds_ratio": -0.4445961117744446,
521
+ "logits/chosen": -2.178088903427124,
522
+ "logits/rejected": -1.742640495300293,
523
+ "logps/chosen": -0.6361523270606995,
524
+ "logps/rejected": -1.0977928638458252,
525
+ "loss": 29.9436,
526
+ "nll_loss": 0.9138515591621399,
527
+ "rewards/accuracies": 0.7875000238418579,
528
+ "rewards/chosen": -0.03180761635303497,
529
+ "rewards/margins": 0.023082025349140167,
530
+ "rewards/rejected": -0.05488964170217514,
531
+ "step": 145
532
+ },
533
+ {
534
+ "epoch": 1.4222222222222223,
535
+ "grad_norm": 28.75,
536
+ "learning_rate": 3.7737761095632374e-05,
537
+ "log_odds_chosen": 0.8333398699760437,
538
+ "log_odds_ratio": -0.4819715619087219,
539
+ "logits/chosen": -2.0540931224823,
540
+ "logits/rejected": -2.0431177616119385,
541
+ "logps/chosen": -0.5949512124061584,
542
+ "logps/rejected": -0.9885191917419434,
543
+ "loss": 29.3897,
544
+ "nll_loss": 0.8951548337936401,
545
+ "rewards/accuracies": 0.7749999761581421,
546
+ "rewards/chosen": -0.029747556895017624,
547
+ "rewards/margins": 0.019678404554724693,
548
+ "rewards/rejected": -0.049425967037677765,
549
+ "step": 150
550
+ },
551
+ {
552
+ "epoch": 1.4696296296296296,
553
+ "grad_norm": 28.5,
554
+ "learning_rate": 3.611785643555225e-05,
555
+ "log_odds_chosen": 0.8982599377632141,
556
+ "log_odds_ratio": -0.44926947355270386,
557
+ "logits/chosen": -2.2695367336273193,
558
+ "logits/rejected": -1.7630071640014648,
559
+ "logps/chosen": -0.6012392044067383,
560
+ "logps/rejected": -1.050581693649292,
561
+ "loss": 29.8918,
562
+ "nll_loss": 0.9108700752258301,
563
+ "rewards/accuracies": 0.78125,
564
+ "rewards/chosen": -0.030061960220336914,
565
+ "rewards/margins": 0.022467125207185745,
566
+ "rewards/rejected": -0.05252908915281296,
567
+ "step": 155
568
+ },
569
+ {
570
+ "epoch": 1.5170370370370372,
571
+ "grad_norm": 30.625,
572
+ "learning_rate": 3.44791085752502e-05,
573
+ "log_odds_chosen": 0.8571161031723022,
574
+ "log_odds_ratio": -0.4356165826320648,
575
+ "logits/chosen": -2.093273639678955,
576
+ "logits/rejected": -2.1345930099487305,
577
+ "logps/chosen": -0.6451684236526489,
578
+ "logps/rejected": -1.1003749370574951,
579
+ "loss": 30.6067,
580
+ "nll_loss": 0.9544156193733215,
581
+ "rewards/accuracies": 0.84375,
582
+ "rewards/chosen": -0.03225841745734215,
583
+ "rewards/margins": 0.02276032790541649,
584
+ "rewards/rejected": -0.055018745362758636,
585
+ "step": 160
586
+ },
587
+ {
588
+ "epoch": 1.5644444444444443,
589
+ "grad_norm": 25.0,
590
+ "learning_rate": 3.2826564912351544e-05,
591
+ "log_odds_chosen": 0.9319137334823608,
592
+ "log_odds_ratio": -0.43036922812461853,
593
+ "logits/chosen": -1.9680538177490234,
594
+ "logits/rejected": -2.1300835609436035,
595
+ "logps/chosen": -0.6165143251419067,
596
+ "logps/rejected": -1.106245994567871,
597
+ "loss": 29.5351,
598
+ "nll_loss": 0.8909838795661926,
599
+ "rewards/accuracies": 0.8125,
600
+ "rewards/chosen": -0.030825715512037277,
601
+ "rewards/margins": 0.024486582726240158,
602
+ "rewards/rejected": -0.055312298238277435,
603
+ "step": 165
604
+ },
605
+ {
606
+ "epoch": 1.6118518518518519,
607
+ "grad_norm": 27.0,
608
+ "learning_rate": 3.116531533601003e-05,
609
+ "log_odds_chosen": 1.0328500270843506,
610
+ "log_odds_ratio": -0.4052696228027344,
611
+ "logits/chosen": -2.1744275093078613,
612
+ "logits/rejected": -1.9558875560760498,
613
+ "logps/chosen": -0.587931752204895,
614
+ "logps/rejected": -1.127718448638916,
615
+ "loss": 29.1635,
616
+ "nll_loss": 0.9016565084457397,
617
+ "rewards/accuracies": 0.8500000238418579,
618
+ "rewards/chosen": -0.02939658798277378,
619
+ "rewards/margins": 0.02698933705687523,
620
+ "rewards/rejected": -0.05638592690229416,
621
+ "step": 170
622
+ },
623
+ {
624
+ "epoch": 1.6592592592592592,
625
+ "grad_norm": 29.875,
626
+ "learning_rate": 2.9500476549880848e-05,
627
+ "log_odds_chosen": 0.8978468179702759,
628
+ "log_odds_ratio": -0.45074257254600525,
629
+ "logits/chosen": -1.9579814672470093,
630
+ "logits/rejected": -1.6362476348876953,
631
+ "logps/chosen": -0.5840794444084167,
632
+ "logps/rejected": -1.0381691455841064,
633
+ "loss": 28.8906,
634
+ "nll_loss": 0.8974016308784485,
635
+ "rewards/accuracies": 0.800000011920929,
636
+ "rewards/chosen": -0.029203975573182106,
637
+ "rewards/margins": 0.022704491391777992,
638
+ "rewards/rejected": -0.0519084632396698,
639
+ "step": 175
640
+ },
641
+ {
642
+ "epoch": 1.7066666666666666,
643
+ "grad_norm": 31.75,
644
+ "learning_rate": 2.7837176312504037e-05,
645
+ "log_odds_chosen": 0.804090678691864,
646
+ "log_odds_ratio": -0.46772366762161255,
647
+ "logits/chosen": -1.7048003673553467,
648
+ "logits/rejected": -1.6802536249160767,
649
+ "logps/chosen": -0.6104881167411804,
650
+ "logps/rejected": -1.00138521194458,
651
+ "loss": 29.964,
652
+ "nll_loss": 0.9319046139717102,
653
+ "rewards/accuracies": 0.7875000238418579,
654
+ "rewards/chosen": -0.03052440844476223,
655
+ "rewards/margins": 0.019544851034879684,
656
+ "rewards/rejected": -0.050069261342287064,
657
+ "step": 180
658
+ },
659
+ {
660
+ "epoch": 1.7540740740740741,
661
+ "grad_norm": 24.25,
662
+ "learning_rate": 2.618053764363861e-05,
663
+ "log_odds_chosen": 0.9093812108039856,
664
+ "log_odds_ratio": -0.4248902201652527,
665
+ "logits/chosen": -2.1142642498016357,
666
+ "logits/rejected": -1.8935844898223877,
667
+ "logps/chosen": -0.5908008813858032,
668
+ "logps/rejected": -1.051133632659912,
669
+ "loss": 29.4169,
670
+ "nll_loss": 0.8831952810287476,
671
+ "rewards/accuracies": 0.831250011920929,
672
+ "rewards/chosen": -0.029540037736296654,
673
+ "rewards/margins": 0.023016640916466713,
674
+ "rewards/rejected": -0.052556682378053665,
675
+ "step": 185
676
+ },
677
+ {
678
+ "epoch": 1.8014814814814815,
679
+ "grad_norm": 26.5,
680
+ "learning_rate": 2.453566304519216e-05,
681
+ "log_odds_chosen": 0.9450467228889465,
682
+ "log_odds_ratio": -0.4345301687717438,
683
+ "logits/chosen": -2.1107406616210938,
684
+ "logits/rejected": -1.6903254985809326,
685
+ "logps/chosen": -0.6346092820167542,
686
+ "logps/rejected": -1.1103118658065796,
687
+ "loss": 30.3828,
688
+ "nll_loss": 0.9147791862487793,
689
+ "rewards/accuracies": 0.8374999761581421,
690
+ "rewards/chosen": -0.03173046559095383,
691
+ "rewards/margins": 0.02378513291478157,
692
+ "rewards/rejected": -0.0555155873298645,
693
+ "step": 190
694
+ },
695
+ {
696
+ "epoch": 1.8488888888888888,
697
+ "grad_norm": 30.5,
698
+ "learning_rate": 2.29076187853462e-05,
699
+ "log_odds_chosen": 0.9741055369377136,
700
+ "log_odds_ratio": -0.4274386465549469,
701
+ "logits/chosen": -2.0201268196105957,
702
+ "logits/rejected": -1.357716679573059,
703
+ "logps/chosen": -0.5966172218322754,
704
+ "logps/rejected": -1.1088060140609741,
705
+ "loss": 29.2191,
706
+ "nll_loss": 0.882198691368103,
707
+ "rewards/accuracies": 0.831250011920929,
708
+ "rewards/chosen": -0.02983086369931698,
709
+ "rewards/margins": 0.025609437376260757,
710
+ "rewards/rejected": -0.05544029921293259,
711
+ "step": 195
712
+ },
713
+ {
714
+ "epoch": 1.8962962962962964,
715
+ "grad_norm": 27.625,
716
+ "learning_rate": 2.130141929428254e-05,
717
+ "log_odds_chosen": 0.8030783534049988,
718
+ "log_odds_ratio": -0.4836719036102295,
719
+ "logits/chosen": -2.072028875350952,
720
+ "logits/rejected": -1.7954127788543701,
721
+ "logps/chosen": -0.6211769580841064,
722
+ "logps/rejected": -1.0174511671066284,
723
+ "loss": 31.2309,
724
+ "nll_loss": 0.9159282445907593,
725
+ "rewards/accuracies": 0.7562500238418579,
726
+ "rewards/chosen": -0.031058847904205322,
727
+ "rewards/margins": 0.01981370709836483,
728
+ "rewards/rejected": -0.050872553139925,
729
+ "step": 200
730
+ },
731
+ {
732
+ "epoch": 1.9437037037037037,
733
+ "grad_norm": 25.5,
734
+ "learning_rate": 1.9722011719572444e-05,
735
+ "log_odds_chosen": 0.8332887887954712,
736
+ "log_odds_ratio": -0.46110400557518005,
737
+ "logits/chosen": -2.209178924560547,
738
+ "logits/rejected": -1.4569910764694214,
739
+ "logps/chosen": -0.614986777305603,
740
+ "logps/rejected": -1.0477509498596191,
741
+ "loss": 28.0719,
742
+ "nll_loss": 0.866260826587677,
743
+ "rewards/accuracies": 0.800000011920929,
744
+ "rewards/chosen": -0.03074934147298336,
745
+ "rewards/margins": 0.021638209000229836,
746
+ "rewards/rejected": -0.0523875467479229,
747
+ "step": 205
748
+ },
749
+ {
750
+ "epoch": 1.991111111111111,
751
+ "grad_norm": 27.375,
752
+ "learning_rate": 1.8174260688798445e-05,
753
+ "log_odds_chosen": 0.7869575619697571,
754
+ "log_odds_ratio": -0.4784061312675476,
755
+ "logits/chosen": -1.8811867237091064,
756
+ "logits/rejected": -2.0677828788757324,
757
+ "logps/chosen": -0.5915592908859253,
758
+ "logps/rejected": -0.9474382400512695,
759
+ "loss": 28.0122,
760
+ "nll_loss": 0.8690091967582703,
761
+ "rewards/accuracies": 0.800000011920929,
762
+ "rewards/chosen": -0.029577964916825294,
763
+ "rewards/margins": 0.017793944105505943,
764
+ "rewards/rejected": -0.04737190902233124,
765
+ "step": 210
766
+ },
767
+ {
768
+ "epoch": 2.0385185185185186,
769
+ "grad_norm": 23.75,
770
+ "learning_rate": 1.666293332634042e-05,
771
+ "log_odds_chosen": 1.401928186416626,
772
+ "log_odds_ratio": -0.33335039019584656,
773
+ "logits/chosen": -1.884316086769104,
774
+ "logits/rejected": -1.4572067260742188,
775
+ "logps/chosen": -0.4979814887046814,
776
+ "logps/rejected": -1.1203409433364868,
777
+ "loss": 24.9695,
778
+ "nll_loss": 0.7534885406494141,
779
+ "rewards/accuracies": 0.887499988079071,
780
+ "rewards/chosen": -0.02489907667040825,
781
+ "rewards/margins": 0.03111797571182251,
782
+ "rewards/rejected": -0.05601705238223076,
783
+ "step": 215
784
+ },
785
+ {
786
+ "epoch": 2.0859259259259257,
787
+ "grad_norm": 38.5,
788
+ "learning_rate": 1.519268457047482e-05,
789
+ "log_odds_chosen": 1.6839864253997803,
790
+ "log_odds_ratio": -0.2904171049594879,
791
+ "logits/chosen": -1.8016027212142944,
792
+ "logits/rejected": -1.8718116283416748,
793
+ "logps/chosen": -0.4482901096343994,
794
+ "logps/rejected": -1.188301682472229,
795
+ "loss": 23.232,
796
+ "nll_loss": 0.7302739024162292,
797
+ "rewards/accuracies": 0.887499988079071,
798
+ "rewards/chosen": -0.02241450548171997,
799
+ "rewards/margins": 0.03700058162212372,
800
+ "rewards/rejected": -0.05941509082913399,
801
+ "step": 220
802
+ },
803
+ {
804
+ "epoch": 2.1333333333333333,
805
+ "grad_norm": 27.5,
806
+ "learning_rate": 1.3768042836010768e-05,
807
+ "log_odds_chosen": 1.6373440027236938,
808
+ "log_odds_ratio": -0.2984515130519867,
809
+ "logits/chosen": -1.8258718252182007,
810
+ "logits/rejected": -1.6223289966583252,
811
+ "logps/chosen": -0.44031524658203125,
812
+ "logps/rejected": -1.1581284999847412,
813
+ "loss": 24.139,
814
+ "nll_loss": 0.7237830758094788,
815
+ "rewards/accuracies": 0.90625,
816
+ "rewards/chosen": -0.022015761584043503,
817
+ "rewards/margins": 0.035890672355890274,
818
+ "rewards/rejected": -0.05790643021464348,
819
+ "step": 225
820
+ },
821
+ {
822
+ "epoch": 2.180740740740741,
823
+ "grad_norm": 27.5,
824
+ "learning_rate": 1.239339606662261e-05,
825
+ "log_odds_chosen": 1.801990270614624,
826
+ "log_odds_ratio": -0.25359493494033813,
827
+ "logits/chosen": -1.93035089969635,
828
+ "logits/rejected": -1.6016725301742554,
829
+ "logps/chosen": -0.4278429448604584,
830
+ "logps/rejected": -1.2159802913665771,
831
+ "loss": 22.8267,
832
+ "nll_loss": 0.7034687995910645,
833
+ "rewards/accuracies": 0.9312499761581421,
834
+ "rewards/chosen": -0.02139214798808098,
835
+ "rewards/margins": 0.039406873285770416,
836
+ "rewards/rejected": -0.060799021273851395,
837
+ "step": 230
838
+ },
839
+ {
840
+ "epoch": 2.228148148148148,
841
+ "grad_norm": 34.5,
842
+ "learning_rate": 1.1072978219838283e-05,
843
+ "log_odds_chosen": 1.565932035446167,
844
+ "log_odds_ratio": -0.3256310820579529,
845
+ "logits/chosen": -1.9141308069229126,
846
+ "logits/rejected": -1.976017951965332,
847
+ "logps/chosen": -0.4726741313934326,
848
+ "logps/rejected": -1.1302521228790283,
849
+ "loss": 23.1599,
850
+ "nll_loss": 0.7224346399307251,
851
+ "rewards/accuracies": 0.862500011920929,
852
+ "rewards/chosen": -0.02363370731472969,
853
+ "rewards/margins": 0.03287890553474426,
854
+ "rewards/rejected": -0.05651261284947395,
855
+ "step": 235
856
+ },
857
+ {
858
+ "epoch": 2.2755555555555556,
859
+ "grad_norm": 32.25,
860
+ "learning_rate": 9.810856226309972e-06,
861
+ "log_odds_chosen": 1.7595332860946655,
862
+ "log_odds_ratio": -0.2692697048187256,
863
+ "logits/chosen": -1.8822906017303467,
864
+ "logits/rejected": -1.698127031326294,
865
+ "logps/chosen": -0.430245578289032,
866
+ "logps/rejected": -1.2083370685577393,
867
+ "loss": 23.1595,
868
+ "nll_loss": 0.7202972173690796,
869
+ "rewards/accuracies": 0.9312499761581421,
870
+ "rewards/chosen": -0.02151227928698063,
871
+ "rewards/margins": 0.0389045774936676,
872
+ "rewards/rejected": -0.06041685491800308,
873
+ "step": 240
874
+ },
875
+ {
876
+ "epoch": 2.322962962962963,
877
+ "grad_norm": 26.5,
878
+ "learning_rate": 8.61091746353324e-06,
879
+ "log_odds_chosen": 1.702959418296814,
880
+ "log_odds_ratio": -0.2750469446182251,
881
+ "logits/chosen": -2.1500630378723145,
882
+ "logits/rejected": -1.591073751449585,
883
+ "logps/chosen": -0.4450320601463318,
884
+ "logps/rejected": -1.1653035879135132,
885
+ "loss": 23.0947,
886
+ "nll_loss": 0.7247873544692993,
887
+ "rewards/accuracies": 0.949999988079071,
888
+ "rewards/chosen": -0.022251605987548828,
889
+ "rewards/margins": 0.03601358085870743,
890
+ "rewards/rejected": -0.058265186846256256,
891
+ "step": 245
892
+ },
893
+ {
894
+ "epoch": 2.3703703703703702,
895
+ "grad_norm": 29.125,
896
+ "learning_rate": 7.47685778259568e-06,
897
+ "log_odds_chosen": 1.729418158531189,
898
+ "log_odds_ratio": -0.25526946783065796,
899
+ "logits/chosen": -1.865269660949707,
900
+ "logits/rejected": -1.8307578563690186,
901
+ "logps/chosen": -0.43531733751296997,
902
+ "logps/rejected": -1.197790503501892,
903
+ "loss": 22.4449,
904
+ "nll_loss": 0.6787145733833313,
905
+ "rewards/accuracies": 0.9375,
906
+ "rewards/chosen": -0.02176586538553238,
907
+ "rewards/margins": 0.038123659789562225,
908
+ "rewards/rejected": -0.059889525175094604,
909
+ "step": 250
910
+ },
911
+ {
912
+ "epoch": 2.417777777777778,
913
+ "grad_norm": 27.125,
914
+ "learning_rate": 6.4121701248332905e-06,
915
+ "log_odds_chosen": 1.894997000694275,
916
+ "log_odds_ratio": -0.2565176784992218,
917
+ "logits/chosen": -1.9798578023910522,
918
+ "logits/rejected": -1.3841679096221924,
919
+ "logps/chosen": -0.3930845260620117,
920
+ "logps/rejected": -1.2068579196929932,
921
+ "loss": 22.3353,
922
+ "nll_loss": 0.6793208122253418,
923
+ "rewards/accuracies": 0.925000011920929,
924
+ "rewards/chosen": -0.019654225558042526,
925
+ "rewards/margins": 0.04068866744637489,
926
+ "rewards/rejected": -0.06034289672970772,
927
+ "step": 255
928
+ },
929
+ {
930
+ "epoch": 2.4651851851851854,
931
+ "grad_norm": 29.625,
932
+ "learning_rate": 5.420133763455645e-06,
933
+ "log_odds_chosen": 1.909266710281372,
934
+ "log_odds_ratio": -0.25379735231399536,
935
+ "logits/chosen": -1.9765899181365967,
936
+ "logits/rejected": -1.7865279912948608,
937
+ "logps/chosen": -0.4143601059913635,
938
+ "logps/rejected": -1.225185751914978,
939
+ "loss": 22.3829,
940
+ "nll_loss": 0.6900944709777832,
941
+ "rewards/accuracies": 0.90625,
942
+ "rewards/chosen": -0.020718006417155266,
943
+ "rewards/margins": 0.04054127633571625,
944
+ "rewards/rejected": -0.06125928834080696,
945
+ "step": 260
946
+ },
947
+ {
948
+ "epoch": 2.5125925925925925,
949
+ "grad_norm": 32.25,
950
+ "learning_rate": 4.503804203275866e-06,
951
+ "log_odds_chosen": 1.7796869277954102,
952
+ "log_odds_ratio": -0.30406466126441956,
953
+ "logits/chosen": -1.8215770721435547,
954
+ "logits/rejected": -1.862717866897583,
955
+ "logps/chosen": -0.4358927607536316,
956
+ "logps/rejected": -1.197788953781128,
957
+ "loss": 22.2978,
958
+ "nll_loss": 0.6913371086120605,
959
+ "rewards/accuracies": 0.8999999761581421,
960
+ "rewards/chosen": -0.0217946395277977,
961
+ "rewards/margins": 0.03809480741620064,
962
+ "rewards/rejected": -0.05988944694399834,
963
+ "step": 265
964
+ },
965
+ {
966
+ "epoch": 2.56,
967
+ "grad_norm": 29.0,
968
+ "learning_rate": 3.6660037696547376e-06,
969
+ "log_odds_chosen": 1.7562096118927002,
970
+ "log_odds_ratio": -0.25910764932632446,
971
+ "logits/chosen": -2.1091978549957275,
972
+ "logits/rejected": -1.8953710794448853,
973
+ "logps/chosen": -0.4530642628669739,
974
+ "logps/rejected": -1.2265712022781372,
975
+ "loss": 23.2665,
976
+ "nll_loss": 0.7341790199279785,
977
+ "rewards/accuracies": 0.9375,
978
+ "rewards/chosen": -0.022653216496109962,
979
+ "rewards/margins": 0.03867534175515175,
980
+ "rewards/rejected": -0.06132856011390686,
981
+ "step": 270
982
+ },
983
+ {
984
+ "epoch": 2.6074074074074076,
985
+ "grad_norm": 30.875,
986
+ "learning_rate": 2.909312915645238e-06,
987
+ "log_odds_chosen": 1.7647335529327393,
988
+ "log_odds_ratio": -0.28405773639678955,
989
+ "logits/chosen": -2.033613681793213,
990
+ "logits/rejected": -1.289603590965271,
991
+ "logps/chosen": -0.4545009732246399,
992
+ "logps/rejected": -1.2115771770477295,
993
+ "loss": 23.1922,
994
+ "nll_loss": 0.7150126695632935,
995
+ "rewards/accuracies": 0.918749988079071,
996
+ "rewards/chosen": -0.022725049406290054,
997
+ "rewards/margins": 0.03785381466150284,
998
+ "rewards/rejected": -0.060578860342502594,
999
+ "step": 275
1000
+ },
1001
+ {
1002
+ "epoch": 2.6548148148148147,
1003
+ "grad_norm": 34.75,
1004
+ "learning_rate": 2.236062274111741e-06,
1005
+ "log_odds_chosen": 1.6408923864364624,
1006
+ "log_odds_ratio": -0.2776089906692505,
1007
+ "logits/chosen": -1.8170640468597412,
1008
+ "logits/rejected": -1.9727897644042969,
1009
+ "logps/chosen": -0.4261111319065094,
1010
+ "logps/rejected": -1.1319156885147095,
1011
+ "loss": 22.1786,
1012
+ "nll_loss": 0.6713584661483765,
1013
+ "rewards/accuracies": 0.956250011920929,
1014
+ "rewards/chosen": -0.02130555734038353,
1015
+ "rewards/margins": 0.035290226340293884,
1016
+ "rewards/rejected": -0.056595779955387115,
1017
+ "step": 280
1018
+ },
1019
+ {
1020
+ "epoch": 2.7022222222222223,
1021
+ "grad_norm": 30.75,
1022
+ "learning_rate": 1.648325479303684e-06,
1023
+ "log_odds_chosen": 1.6113086938858032,
1024
+ "log_odds_ratio": -0.2935238778591156,
1025
+ "logits/chosen": -2.0707173347473145,
1026
+ "logits/rejected": -1.4977672100067139,
1027
+ "logps/chosen": -0.4346179962158203,
1028
+ "logps/rejected": -1.1482003927230835,
1029
+ "loss": 23.1607,
1030
+ "nll_loss": 0.6985403895378113,
1031
+ "rewards/accuracies": 0.918749988079071,
1032
+ "rewards/chosen": -0.021730897948145866,
1033
+ "rewards/margins": 0.03567912429571152,
1034
+ "rewards/rejected": -0.05741002410650253,
1035
+ "step": 285
1036
+ },
1037
+ {
1038
+ "epoch": 2.74962962962963,
1039
+ "grad_norm": 29.75,
1040
+ "learning_rate": 1.1479127799935029e-06,
1041
+ "log_odds_chosen": 1.8265297412872314,
1042
+ "log_odds_ratio": -0.2631281614303589,
1043
+ "logits/chosen": -1.841491937637329,
1044
+ "logits/rejected": -1.922586441040039,
1045
+ "logps/chosen": -0.4327624440193176,
1046
+ "logps/rejected": -1.2341258525848389,
1047
+ "loss": 22.9795,
1048
+ "nll_loss": 0.7204877734184265,
1049
+ "rewards/accuracies": 0.9375,
1050
+ "rewards/chosen": -0.021638119593262672,
1051
+ "rewards/margins": 0.04006817191839218,
1052
+ "rewards/rejected": -0.0617062933743,
1053
+ "step": 290
1054
+ },
1055
+ {
1056
+ "epoch": 2.797037037037037,
1057
+ "grad_norm": 31.625,
1058
+ "learning_rate": 7.363654638505046e-07,
1059
+ "log_odds_chosen": 1.7081098556518555,
1060
+ "log_odds_ratio": -0.29199516773223877,
1061
+ "logits/chosen": -1.7930389642715454,
1062
+ "logits/rejected": -1.7181438207626343,
1063
+ "logps/chosen": -0.449666827917099,
1064
+ "logps/rejected": -1.2206642627716064,
1065
+ "loss": 22.9709,
1066
+ "nll_loss": 0.7106753587722778,
1067
+ "rewards/accuracies": 0.9125000238418579,
1068
+ "rewards/chosen": -0.0224833432585001,
1069
+ "rewards/margins": 0.03854987770318985,
1070
+ "rewards/rejected": -0.0610332190990448,
1071
+ "step": 295
1072
+ },
1073
+ {
1074
+ "epoch": 2.8444444444444446,
1075
+ "grad_norm": 31.25,
1076
+ "learning_rate": 4.149511102238568e-07,
1077
+ "log_odds_chosen": 1.5754259824752808,
1078
+ "log_odds_ratio": -0.3034347891807556,
1079
+ "logits/chosen": -2.2559409141540527,
1080
+ "logits/rejected": -1.71217942237854,
1081
+ "logps/chosen": -0.46836423873901367,
1082
+ "logps/rejected": -1.2223981618881226,
1083
+ "loss": 22.8601,
1084
+ "nll_loss": 0.7257949113845825,
1085
+ "rewards/accuracies": 0.8999999761581421,
1086
+ "rewards/chosen": -0.023418214172124863,
1087
+ "rewards/margins": 0.037701696157455444,
1088
+ "rewards/rejected": -0.06111990660429001,
1089
+ "step": 300
1090
+ },
1091
+ {
1092
+ "epoch": 2.891851851851852,
1093
+ "grad_norm": 28.625,
1094
+ "learning_rate": 1.8465968595625105e-07,
1095
+ "log_odds_chosen": 1.6639511585235596,
1096
+ "log_odds_ratio": -0.2808656096458435,
1097
+ "logits/chosen": -2.1249499320983887,
1098
+ "logits/rejected": -1.6745857000350952,
1099
+ "logps/chosen": -0.475193589925766,
1100
+ "logps/rejected": -1.1988952159881592,
1101
+ "loss": 21.8942,
1102
+ "nll_loss": 0.6743995547294617,
1103
+ "rewards/accuracies": 0.925000011920929,
1104
+ "rewards/chosen": -0.02375968173146248,
1105
+ "rewards/margins": 0.03618507459759712,
1106
+ "rewards/rejected": -0.0599447600543499,
1107
+ "step": 305
1108
+ },
1109
+ {
1110
+ "epoch": 2.9392592592592592,
1111
+ "grad_norm": 30.75,
1112
+ "learning_rate": 4.620049625329803e-08,
1113
+ "log_odds_chosen": 1.7966537475585938,
1114
+ "log_odds_ratio": -0.25377795100212097,
1115
+ "logits/chosen": -1.9389528036117554,
1116
+ "logits/rejected": -1.399864912033081,
1117
+ "logps/chosen": -0.4379648268222809,
1118
+ "logps/rejected": -1.1957590579986572,
1119
+ "loss": 22.7814,
1120
+ "nll_loss": 0.6846402883529663,
1121
+ "rewards/accuracies": 0.9437500238418579,
1122
+ "rewards/chosen": -0.021898243576288223,
1123
+ "rewards/margins": 0.03788971155881882,
1124
+ "rewards/rejected": -0.05978795886039734,
1125
+ "step": 310
1126
+ },
1127
+ {
1128
+ "epoch": 2.986666666666667,
1129
+ "grad_norm": 32.75,
1130
+ "learning_rate": 0.0,
1131
+ "log_odds_chosen": 1.8841956853866577,
1132
+ "log_odds_ratio": -0.24858447909355164,
1133
+ "logits/chosen": -1.8983337879180908,
1134
+ "logits/rejected": -1.5074989795684814,
1135
+ "logps/chosen": -0.40289902687072754,
1136
+ "logps/rejected": -1.209084391593933,
1137
+ "loss": 22.233,
1138
+ "nll_loss": 0.6980301737785339,
1139
+ "rewards/accuracies": 0.956250011920929,
1140
+ "rewards/chosen": -0.020144950598478317,
1141
+ "rewards/margins": 0.040309272706508636,
1142
+ "rewards/rejected": -0.06045422703027725,
1143
+ "step": 315
1144
+ },
1145
+ {
1146
+ "epoch": 2.986666666666667,
1147
+ "step": 315,
1148
+ "total_flos": 0.0,
1149
+ "train_loss": 31.135096304757255,
1150
+ "train_runtime": 6745.6063,
1151
+ "train_samples_per_second": 3.002,
1152
+ "train_steps_per_second": 0.047
1153
+ }
1154
+ ],
1155
+ "logging_steps": 5,
1156
+ "max_steps": 315,
1157
+ "num_input_tokens_seen": 0,
1158
+ "num_train_epochs": 3,
1159
+ "save_steps": 100000,
1160
+ "stateful_callbacks": {
1161
+ "TrainerControl": {
1162
+ "args": {
1163
+ "should_epoch_stop": false,
1164
+ "should_evaluate": false,
1165
+ "should_log": false,
1166
+ "should_save": true,
1167
+ "should_training_stop": true
1168
+ },
1169
+ "attributes": {}
1170
+ }
1171
+ },
1172
+ "total_flos": 0.0,
1173
+ "train_batch_size": 1,
1174
+ "trial_name": null,
1175
+ "trial_params": null
1176
+ }