[2024-07-27 07:12:22,627][Main][INFO] - Distributed environment: NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda Mixed precision type: bf16 [2024-07-27 07:12:22,627][Main][INFO] - Working directory is /workspace/nanoT5/logs/2024-07-27/07-12-22- [2024-07-27 07:12:26,801][datasets_modules.datasets.c4.584d57ebe81c209b6c7f31727066d2c4b4bba37cb7092cdd83083d5ec11207db.c4][INFO] - generating examples from = https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-train.00774-of-01024.json.gz [2024-07-27 07:12:26,801][datasets_modules.datasets.c4.584d57ebe81c209b6c7f31727066d2c4b4bba37cb7092cdd83083d5ec11207db.c4][INFO] - generating examples from = https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-train.00314-of-01024.json.gz [2024-07-27 07:12:26,801][datasets_modules.datasets.c4.584d57ebe81c209b6c7f31727066d2c4b4bba37cb7092cdd83083d5ec11207db.c4][INFO] - generating examples from = https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-train.00780-of-01024.json.gz [2024-07-27 07:12:26,801][datasets_modules.datasets.c4.584d57ebe81c209b6c7f31727066d2c4b4bba37cb7092cdd83083d5ec11207db.c4][INFO] - generating examples from = https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-train.00831-of-01024.json.gz [2024-07-27 07:12:26,802][datasets_modules.datasets.c4.584d57ebe81c209b6c7f31727066d2c4b4bba37cb7092cdd83083d5ec11207db.c4][INFO] - generating examples from = https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-train.00113-of-01024.json.gz [2024-07-27 07:12:26,802][datasets_modules.datasets.c4.584d57ebe81c209b6c7f31727066d2c4b4bba37cb7092cdd83083d5ec11207db.c4][INFO] - generating examples from = https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-train.00805-of-01024.json.gz [2024-07-27 07:12:26,802][datasets_modules.datasets.c4.584d57ebe81c209b6c7f31727066d2c4b4bba37cb7092cdd83083d5ec11207db.c4][INFO] - generating examples from = https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-train.00587-of-01024.json.gz [2024-07-27 07:12:26,802][datasets_modules.datasets.c4.584d57ebe81c209b6c7f31727066d2c4b4bba37cb7092cdd83083d5ec11207db.c4][INFO] - generating examples from = https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-train.00854-of-01024.json.gz [2024-07-27 07:14:00,798][Main][INFO] - [train] Step 100 out of 65536 | Loss --> 58.651 | Grad_l2 --> 98.412 | Weights_l2 --> 10016.240 | Lr --> 0.010 | Seconds_per_step --> 0.942 | [2024-07-27 07:14:34,181][Main][INFO] - [train] Step 200 out of 65536 | Loss --> 11.804 | Grad_l2 --> 10.229 | Weights_l2 --> 10015.650 | Lr --> 0.010 | Seconds_per_step --> 0.334 | [2024-07-27 07:15:07,762][Main][INFO] - [train] Step 300 out of 65536 | Loss --> 8.260 | Grad_l2 --> 5.749 | Weights_l2 --> 10015.415 | Lr --> 0.010 | Seconds_per_step --> 0.336 | [2024-07-27 07:15:41,166][Main][INFO] - [train] Step 400 out of 65536 | Loss --> 7.378 | Grad_l2 --> 4.103 | Weights_l2 --> 10015.222 | Lr --> 0.010 | Seconds_per_step --> 0.334 | [2024-07-27 07:16:15,144][Main][INFO] - [train] Step 500 out of 65536 | Loss --> 6.789 | Grad_l2 --> 2.741 | Weights_l2 --> 10015.188 | Lr --> 0.011 | Seconds_per_step --> 0.340 | [2024-07-27 07:16:48,555][Main][INFO] - [train] Step 600 out of 65536 | Loss --> 6.449 | Grad_l2 --> 2.457 | Weights_l2 --> 10015.376 | Lr --> 0.011 | Seconds_per_step --> 0.334 | [2024-07-27 07:17:23,186][Main][INFO] - [train] Step 700 out of 65536 | Loss --> 6.282 | Grad_l2 --> 1.699 | Weights_l2 --> 10016.116 | Lr --> 0.011 | Seconds_per_step --> 0.346 | [2024-07-27 07:17:58,339][Main][INFO] - [train] Step 800 out of 65536 | Loss --> 6.181 | Grad_l2 --> 1.662 | Weights_l2 --> 10017.423 | Lr --> 0.011 | Seconds_per_step --> 0.352 | [2024-07-27 07:18:31,872][Main][INFO] - [train] Step 900 out of 65536 | Loss --> 6.067 | Grad_l2 --> 1.446 | Weights_l2 --> 10019.284 | Lr --> 0.011 | Seconds_per_step --> 0.335 | [2024-07-27 07:19:05,432][Main][INFO] - [train] Step 1000 out of 65536 | Loss --> 6.005 | Grad_l2 --> 1.281 | Weights_l2 --> 10021.594 | Lr --> 0.011 | Seconds_per_step --> 0.336 | [2024-07-27 07:19:39,253][Main][INFO] - [train] Step 1100 out of 65536 | Loss --> 5.948 | Grad_l2 --> 1.495 | Weights_l2 --> 10024.418 | Lr --> 0.011 | Seconds_per_step --> 0.338 | [2024-07-27 07:20:12,777][Main][INFO] - [train] Step 1200 out of 65536 | Loss --> 5.879 | Grad_l2 --> 1.082 | Weights_l2 --> 10027.518 | Lr --> 0.011 | Seconds_per_step --> 0.335 | [2024-07-27 07:20:46,294][Main][INFO] - [train] Step 1300 out of 65536 | Loss --> 5.828 | Grad_l2 --> 0.999 | Weights_l2 --> 10030.975 | Lr --> 0.011 | Seconds_per_step --> 0.335 | [2024-07-27 07:21:19,803][Main][INFO] - [train] Step 1400 out of 65536 | Loss --> 5.791 | Grad_l2 --> 1.041 | Weights_l2 --> 10034.530 | Lr --> 0.011 | Seconds_per_step --> 0.335 | [2024-07-27 07:21:53,088][Main][INFO] - [train] Step 1500 out of 65536 | Loss --> 5.748 | Grad_l2 --> 0.959 | Weights_l2 --> 10038.048 | Lr --> 0.012 | Seconds_per_step --> 0.333 | [2024-07-27 07:22:26,371][Main][INFO] - [train] Step 1600 out of 65536 | Loss --> 5.695 | Grad_l2 --> 0.910 | Weights_l2 --> 10041.782 | Lr --> 0.012 | Seconds_per_step --> 0.333 | [2024-07-27 07:22:59,963][Main][INFO] - [train] Step 1700 out of 65536 | Loss --> 5.666 | Grad_l2 --> 0.860 | Weights_l2 --> 10045.583 | Lr --> 0.012 | Seconds_per_step --> 0.336 | [2024-07-27 07:23:33,278][Main][INFO] - [train] Step 1800 out of 65536 | Loss --> 5.636 | Grad_l2 --> 0.851 | Weights_l2 --> 10049.546 | Lr --> 0.012 | Seconds_per_step --> 0.333 | [2024-07-27 07:24:07,082][Main][INFO] - [train] Step 1900 out of 65536 | Loss --> 5.608 | Grad_l2 --> 0.839 | Weights_l2 --> 10053.562 | Lr --> 0.012 | Seconds_per_step --> 0.338 | [2024-07-27 07:24:41,007][Main][INFO] - [train] Step 2000 out of 65536 | Loss --> 5.591 | Grad_l2 --> 0.853 | Weights_l2 --> 10057.785 | Lr --> 0.012 | Seconds_per_step --> 0.339 | [2024-07-27 07:25:14,236][Main][INFO] - [train] Step 2100 out of 65536 | Loss --> 5.572 | Grad_l2 --> 0.818 | Weights_l2 --> 10062.304 | Lr --> 0.012 | Seconds_per_step --> 0.332 | [2024-07-27 07:25:47,469][Main][INFO] - [train] Step 2200 out of 65536 | Loss --> 5.540 | Grad_l2 --> 0.754 | Weights_l2 --> 10067.057 | Lr --> 0.012 | Seconds_per_step --> 0.332 | [2024-07-27 07:26:21,707][Main][INFO] - [train] Step 2300 out of 65536 | Loss --> 5.530 | Grad_l2 --> 0.723 | Weights_l2 --> 10072.157 | Lr --> 0.012 | Seconds_per_step --> 0.342 | [2024-07-27 07:26:56,079][Main][INFO] - [train] Step 2400 out of 65536 | Loss --> 5.511 | Grad_l2 --> 0.736 | Weights_l2 --> 10077.321 | Lr --> 0.012 | Seconds_per_step --> 0.344 | [2024-07-27 07:27:29,428][Main][INFO] - [train] Step 2500 out of 65536 | Loss --> 5.481 | Grad_l2 --> 0.698 | Weights_l2 --> 10082.647 | Lr --> 0.013 | Seconds_per_step --> 0.333 | [2024-07-27 07:27:29,429][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-2500 [2024-07-27 07:27:29,431][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 07:27:30,885][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-2500/model.safetensors [2024-07-27 07:27:32,588][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-2500/optimizer.bin [2024-07-27 07:27:32,588][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-2500/scheduler.bin [2024-07-27 07:27:32,589][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-2500/sampler.bin [2024-07-27 07:27:32,589][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-2500/sampler_1.bin [2024-07-27 07:27:32,589][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-2500/random_states_0.pkl [2024-07-27 07:28:07,095][Main][INFO] - [train] Step 2600 out of 65536 | Loss --> 5.469 | Grad_l2 --> 0.693 | Weights_l2 --> 10088.241 | Lr --> 0.013 | Seconds_per_step --> 0.377 | [2024-07-27 07:28:40,488][Main][INFO] - [train] Step 2700 out of 65536 | Loss --> 5.453 | Grad_l2 --> 0.721 | Weights_l2 --> 10094.068 | Lr --> 0.013 | Seconds_per_step --> 0.334 | [2024-07-27 07:29:13,740][Main][INFO] - [train] Step 2800 out of 65536 | Loss --> 5.430 | Grad_l2 --> 0.662 | Weights_l2 --> 10100.075 | Lr --> 0.013 | Seconds_per_step --> 0.333 | [2024-07-27 07:29:47,481][Main][INFO] - [train] Step 2900 out of 65536 | Loss --> 5.427 | Grad_l2 --> 0.730 | Weights_l2 --> 10106.393 | Lr --> 0.013 | Seconds_per_step --> 0.337 | [2024-07-27 07:30:21,279][Main][INFO] - [train] Step 3000 out of 65536 | Loss --> 5.411 | Grad_l2 --> 0.682 | Weights_l2 --> 10113.085 | Lr --> 0.013 | Seconds_per_step --> 0.338 | [2024-07-27 07:30:54,544][Main][INFO] - [train] Step 3100 out of 65536 | Loss --> 5.379 | Grad_l2 --> 0.677 | Weights_l2 --> 10119.840 | Lr --> 0.013 | Seconds_per_step --> 0.333 | [2024-07-27 07:31:27,892][Main][INFO] - [train] Step 3200 out of 65536 | Loss --> 5.365 | Grad_l2 --> 0.684 | Weights_l2 --> 10127.146 | Lr --> 0.013 | Seconds_per_step --> 0.333 | [2024-07-27 07:32:01,425][Main][INFO] - [train] Step 3300 out of 65536 | Loss --> 5.338 | Grad_l2 --> 0.653 | Weights_l2 --> 10134.625 | Lr --> 0.013 | Seconds_per_step --> 0.335 | [2024-07-27 07:32:34,847][Main][INFO] - [train] Step 3400 out of 65536 | Loss --> 5.321 | Grad_l2 --> 0.648 | Weights_l2 --> 10142.228 | Lr --> 0.013 | Seconds_per_step --> 0.334 | [2024-07-27 07:33:08,177][Main][INFO] - [train] Step 3500 out of 65536 | Loss --> 5.288 | Grad_l2 --> 0.622 | Weights_l2 --> 10150.172 | Lr --> 0.014 | Seconds_per_step --> 0.333 | [2024-07-27 07:33:41,951][Main][INFO] - [train] Step 3600 out of 65536 | Loss --> 5.189 | Grad_l2 --> 0.699 | Weights_l2 --> 10158.811 | Lr --> 0.014 | Seconds_per_step --> 0.338 | [2024-07-27 07:34:16,216][Main][INFO] - [train] Step 3700 out of 65536 | Loss --> 5.029 | Grad_l2 --> 0.694 | Weights_l2 --> 10168.377 | Lr --> 0.014 | Seconds_per_step --> 0.343 | [2024-07-27 07:34:49,487][Main][INFO] - [train] Step 3800 out of 65536 | Loss --> 4.964 | Grad_l2 --> 0.656 | Weights_l2 --> 10178.324 | Lr --> 0.014 | Seconds_per_step --> 0.333 | [2024-07-27 07:35:23,223][Main][INFO] - [train] Step 3900 out of 65536 | Loss --> 4.899 | Grad_l2 --> 0.673 | Weights_l2 --> 10188.551 | Lr --> 0.014 | Seconds_per_step --> 0.337 | [2024-07-27 07:35:56,788][Main][INFO] - [train] Step 4000 out of 65536 | Loss --> 4.812 | Grad_l2 --> 0.626 | Weights_l2 --> 10199.091 | Lr --> 0.014 | Seconds_per_step --> 0.336 | [2024-07-27 07:36:30,291][Main][INFO] - [train] Step 4100 out of 65536 | Loss --> 4.743 | Grad_l2 --> 0.619 | Weights_l2 --> 10210.047 | Lr --> 0.014 | Seconds_per_step --> 0.335 | [2024-07-27 07:37:03,932][Main][INFO] - [train] Step 4200 out of 65536 | Loss --> 4.690 | Grad_l2 --> 0.616 | Weights_l2 --> 10221.586 | Lr --> 0.014 | Seconds_per_step --> 0.336 | [2024-07-27 07:37:37,261][Main][INFO] - [train] Step 4300 out of 65536 | Loss --> 4.630 | Grad_l2 --> 0.625 | Weights_l2 --> 10233.226 | Lr --> 0.014 | Seconds_per_step --> 0.333 | [2024-07-27 07:38:10,554][Main][INFO] - [train] Step 4400 out of 65536 | Loss --> 4.598 | Grad_l2 --> 0.583 | Weights_l2 --> 10245.163 | Lr --> 0.014 | Seconds_per_step --> 0.333 | [2024-07-27 07:38:43,862][Main][INFO] - [train] Step 4500 out of 65536 | Loss --> 4.563 | Grad_l2 --> 0.581 | Weights_l2 --> 10257.195 | Lr --> 0.015 | Seconds_per_step --> 0.333 | [2024-07-27 07:39:17,161][Main][INFO] - [train] Step 4600 out of 65536 | Loss --> 4.534 | Grad_l2 --> 0.627 | Weights_l2 --> 10269.378 | Lr --> 0.015 | Seconds_per_step --> 0.333 | [2024-07-27 07:39:50,476][Main][INFO] - [train] Step 4700 out of 65536 | Loss --> 4.496 | Grad_l2 --> 0.581 | Weights_l2 --> 10281.588 | Lr --> 0.015 | Seconds_per_step --> 0.333 | [2024-07-27 07:40:24,805][Main][INFO] - [train] Step 4800 out of 65536 | Loss --> 4.478 | Grad_l2 --> 0.592 | Weights_l2 --> 10294.103 | Lr --> 0.015 | Seconds_per_step --> 0.343 | [2024-07-27 07:40:59,172][Main][INFO] - [train] Step 4900 out of 65536 | Loss --> 4.445 | Grad_l2 --> 0.567 | Weights_l2 --> 10306.724 | Lr --> 0.015 | Seconds_per_step --> 0.344 | [2024-07-27 07:41:33,698][Main][INFO] - [train] Step 5000 out of 65536 | Loss --> 4.425 | Grad_l2 --> 0.580 | Weights_l2 --> 10319.588 | Lr --> 0.015 | Seconds_per_step --> 0.345 | [2024-07-27 07:41:33,698][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-5000 [2024-07-27 07:41:33,700][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 07:41:35,224][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-5000/model.safetensors [2024-07-27 07:41:36,979][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-5000/optimizer.bin [2024-07-27 07:41:36,980][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-5000/scheduler.bin [2024-07-27 07:41:36,980][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-5000/sampler.bin [2024-07-27 07:41:36,980][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-5000/sampler_1.bin [2024-07-27 07:41:36,981][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-5000/random_states_0.pkl [2024-07-27 07:42:11,344][Main][INFO] - [train] Step 5100 out of 65536 | Loss --> 4.391 | Grad_l2 --> 0.574 | Weights_l2 --> 10332.673 | Lr --> 0.015 | Seconds_per_step --> 0.376 | [2024-07-27 07:42:45,610][Main][INFO] - [train] Step 5200 out of 65536 | Loss --> 4.384 | Grad_l2 --> 0.599 | Weights_l2 --> 10346.074 | Lr --> 0.015 | Seconds_per_step --> 0.343 | [2024-07-27 07:43:19,142][Main][INFO] - [train] Step 5300 out of 65536 | Loss --> 4.346 | Grad_l2 --> 0.556 | Weights_l2 --> 10359.658 | Lr --> 0.015 | Seconds_per_step --> 0.335 | [2024-07-27 07:43:53,024][Main][INFO] - [train] Step 5400 out of 65536 | Loss --> 4.343 | Grad_l2 --> 0.548 | Weights_l2 --> 10373.877 | Lr --> 0.015 | Seconds_per_step --> 0.339 | [2024-07-27 07:44:26,545][Main][INFO] - [train] Step 5500 out of 65536 | Loss --> 4.300 | Grad_l2 --> 0.569 | Weights_l2 --> 10388.103 | Lr --> 0.016 | Seconds_per_step --> 0.335 | [2024-07-27 07:44:59,964][Main][INFO] - [train] Step 5600 out of 65536 | Loss --> 4.280 | Grad_l2 --> 0.568 | Weights_l2 --> 10402.629 | Lr --> 0.016 | Seconds_per_step --> 0.334 | [2024-07-27 07:45:33,259][Main][INFO] - [train] Step 5700 out of 65536 | Loss --> 4.272 | Grad_l2 --> 0.565 | Weights_l2 --> 10417.830 | Lr --> 0.016 | Seconds_per_step --> 0.333 | [2024-07-27 07:46:06,563][Main][INFO] - [train] Step 5800 out of 65536 | Loss --> 4.232 | Grad_l2 --> 0.550 | Weights_l2 --> 10433.360 | Lr --> 0.016 | Seconds_per_step --> 0.333 | [2024-07-27 07:46:39,844][Main][INFO] - [train] Step 5900 out of 65536 | Loss --> 4.207 | Grad_l2 --> 0.560 | Weights_l2 --> 10448.875 | Lr --> 0.016 | Seconds_per_step --> 0.333 | [2024-07-27 07:47:13,459][Main][INFO] - [train] Step 6000 out of 65536 | Loss --> 4.183 | Grad_l2 --> 0.552 | Weights_l2 --> 10464.983 | Lr --> 0.016 | Seconds_per_step --> 0.336 | [2024-07-27 07:47:48,385][Main][INFO] - [train] Step 6100 out of 65536 | Loss --> 4.159 | Grad_l2 --> 0.554 | Weights_l2 --> 10481.362 | Lr --> 0.016 | Seconds_per_step --> 0.349 | [2024-07-27 07:48:22,895][Main][INFO] - [train] Step 6200 out of 65536 | Loss --> 4.124 | Grad_l2 --> 0.520 | Weights_l2 --> 10498.185 | Lr --> 0.016 | Seconds_per_step --> 0.345 | [2024-07-27 07:48:57,430][Main][INFO] - [train] Step 6300 out of 65536 | Loss --> 4.109 | Grad_l2 --> 0.559 | Weights_l2 --> 10515.588 | Lr --> 0.016 | Seconds_per_step --> 0.345 | [2024-07-27 07:49:31,192][Main][INFO] - [train] Step 6400 out of 65536 | Loss --> 4.085 | Grad_l2 --> 0.534 | Weights_l2 --> 10532.990 | Lr --> 0.016 | Seconds_per_step --> 0.338 | [2024-07-27 07:50:04,452][Main][INFO] - [train] Step 6500 out of 65536 | Loss --> 4.071 | Grad_l2 --> 0.547 | Weights_l2 --> 10550.641 | Lr --> 0.017 | Seconds_per_step --> 0.333 | [2024-07-27 07:50:38,301][Main][INFO] - [train] Step 6600 out of 65536 | Loss --> 4.050 | Grad_l2 --> 0.548 | Weights_l2 --> 10568.688 | Lr --> 0.017 | Seconds_per_step --> 0.338 | [2024-07-27 07:51:11,754][Main][INFO] - [train] Step 6700 out of 65536 | Loss --> 4.017 | Grad_l2 --> 0.577 | Weights_l2 --> 10586.684 | Lr --> 0.017 | Seconds_per_step --> 0.335 | [2024-07-27 07:51:45,072][Main][INFO] - [train] Step 6800 out of 65536 | Loss --> 3.989 | Grad_l2 --> 0.515 | Weights_l2 --> 10604.471 | Lr --> 0.017 | Seconds_per_step --> 0.333 | [2024-07-27 07:52:19,589][Main][INFO] - [train] Step 6900 out of 65536 | Loss --> 3.987 | Grad_l2 --> 0.516 | Weights_l2 --> 10622.636 | Lr --> 0.017 | Seconds_per_step --> 0.345 | [2024-07-27 07:52:54,132][Main][INFO] - [train] Step 7000 out of 65536 | Loss --> 3.970 | Grad_l2 --> 0.517 | Weights_l2 --> 10640.896 | Lr --> 0.017 | Seconds_per_step --> 0.345 | [2024-07-27 07:53:28,146][Main][INFO] - [train] Step 7100 out of 65536 | Loss --> 3.944 | Grad_l2 --> 0.550 | Weights_l2 --> 10659.409 | Lr --> 0.017 | Seconds_per_step --> 0.340 | [2024-07-27 07:54:01,717][Main][INFO] - [train] Step 7200 out of 65536 | Loss --> 3.943 | Grad_l2 --> 0.569 | Weights_l2 --> 10678.709 | Lr --> 0.017 | Seconds_per_step --> 0.336 | [2024-07-27 07:54:35,013][Main][INFO] - [train] Step 7300 out of 65536 | Loss --> 3.916 | Grad_l2 --> 0.524 | Weights_l2 --> 10697.887 | Lr --> 0.017 | Seconds_per_step --> 0.333 | [2024-07-27 07:55:08,299][Main][INFO] - [train] Step 7400 out of 65536 | Loss --> 3.868 | Grad_l2 --> 0.500 | Weights_l2 --> 10716.871 | Lr --> 0.017 | Seconds_per_step --> 0.333 | [2024-07-27 07:55:41,578][Main][INFO] - [train] Step 7500 out of 65536 | Loss --> 3.878 | Grad_l2 --> 0.505 | Weights_l2 --> 10735.951 | Lr --> 0.018 | Seconds_per_step --> 0.333 | [2024-07-27 07:55:41,579][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-7500 [2024-07-27 07:55:41,581][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 07:55:43,020][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-7500/model.safetensors [2024-07-27 07:55:44,547][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-7500/optimizer.bin [2024-07-27 07:55:44,547][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-7500/scheduler.bin [2024-07-27 07:55:44,548][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-7500/sampler.bin [2024-07-27 07:55:44,548][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-7500/sampler_1.bin [2024-07-27 07:55:44,548][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-7500/random_states_0.pkl [2024-07-27 07:56:18,003][Main][INFO] - [train] Step 7600 out of 65536 | Loss --> 3.846 | Grad_l2 --> 0.499 | Weights_l2 --> 10754.944 | Lr --> 0.018 | Seconds_per_step --> 0.364 | [2024-07-27 07:56:55,517][Main][INFO] - [train] Step 7700 out of 65536 | Loss --> 3.842 | Grad_l2 --> 0.507 | Weights_l2 --> 10774.257 | Lr --> 0.018 | Seconds_per_step --> 0.375 | [2024-07-27 07:57:30,371][Main][INFO] - [train] Step 7800 out of 65536 | Loss --> 3.837 | Grad_l2 --> 0.491 | Weights_l2 --> 10794.147 | Lr --> 0.018 | Seconds_per_step --> 0.349 | [2024-07-27 07:58:04,985][Main][INFO] - [train] Step 7900 out of 65536 | Loss --> 3.795 | Grad_l2 --> 0.494 | Weights_l2 --> 10813.758 | Lr --> 0.018 | Seconds_per_step --> 0.346 | [2024-07-27 07:58:39,334][Main][INFO] - [train] Step 8000 out of 65536 | Loss --> 3.786 | Grad_l2 --> 0.509 | Weights_l2 --> 10833.650 | Lr --> 0.018 | Seconds_per_step --> 0.343 | [2024-07-27 07:59:13,980][Main][INFO] - [train] Step 8100 out of 65536 | Loss --> 3.780 | Grad_l2 --> 0.484 | Weights_l2 --> 10853.753 | Lr --> 0.018 | Seconds_per_step --> 0.346 | [2024-07-27 07:59:47,355][Main][INFO] - [train] Step 8200 out of 65536 | Loss --> 3.762 | Grad_l2 --> 0.474 | Weights_l2 --> 10873.724 | Lr --> 0.018 | Seconds_per_step --> 0.334 | [2024-07-27 08:00:20,624][Main][INFO] - [train] Step 8300 out of 65536 | Loss --> 3.738 | Grad_l2 --> 0.476 | Weights_l2 --> 10893.682 | Lr --> 0.018 | Seconds_per_step --> 0.333 | [2024-07-27 08:00:54,222][Main][INFO] - [train] Step 8400 out of 65536 | Loss --> 3.746 | Grad_l2 --> 0.485 | Weights_l2 --> 10914.090 | Lr --> 0.018 | Seconds_per_step --> 0.336 | [2024-07-27 08:01:27,509][Main][INFO] - [train] Step 8500 out of 65536 | Loss --> 3.719 | Grad_l2 --> 0.478 | Weights_l2 --> 10934.565 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:02:00,794][Main][INFO] - [train] Step 8600 out of 65536 | Loss --> 3.711 | Grad_l2 --> 0.487 | Weights_l2 --> 10955.518 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:02:34,071][Main][INFO] - [train] Step 8700 out of 65536 | Loss --> 3.706 | Grad_l2 --> 0.487 | Weights_l2 --> 10976.979 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:03:07,363][Main][INFO] - [train] Step 8800 out of 65536 | Loss --> 3.694 | Grad_l2 --> 0.459 | Weights_l2 --> 10997.958 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:03:40,644][Main][INFO] - [train] Step 8900 out of 65536 | Loss --> 3.669 | Grad_l2 --> 0.466 | Weights_l2 --> 11019.523 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:04:13,947][Main][INFO] - [train] Step 9000 out of 65536 | Loss --> 3.664 | Grad_l2 --> 0.458 | Weights_l2 --> 11041.078 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:04:47,544][Main][INFO] - [train] Step 9100 out of 65536 | Loss --> 3.648 | Grad_l2 --> 0.453 | Weights_l2 --> 11062.927 | Lr --> 0.019 | Seconds_per_step --> 0.336 | [2024-07-27 08:05:20,851][Main][INFO] - [train] Step 9200 out of 65536 | Loss --> 3.644 | Grad_l2 --> 0.445 | Weights_l2 --> 11084.693 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:05:54,140][Main][INFO] - [train] Step 9300 out of 65536 | Loss --> 3.635 | Grad_l2 --> 0.439 | Weights_l2 --> 11106.778 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:06:27,446][Main][INFO] - [train] Step 9400 out of 65536 | Loss --> 3.613 | Grad_l2 --> 0.460 | Weights_l2 --> 11129.217 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:07:00,747][Main][INFO] - [train] Step 9500 out of 65536 | Loss --> 3.600 | Grad_l2 --> 0.456 | Weights_l2 --> 11151.878 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:07:34,061][Main][INFO] - [train] Step 9600 out of 65536 | Loss --> 3.612 | Grad_l2 --> 0.452 | Weights_l2 --> 11174.948 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:08:07,648][Main][INFO] - [train] Step 9700 out of 65536 | Loss --> 3.599 | Grad_l2 --> 0.452 | Weights_l2 --> 11197.920 | Lr --> 0.020 | Seconds_per_step --> 0.336 | [2024-07-27 08:08:40,961][Main][INFO] - [train] Step 9800 out of 65536 | Loss --> 3.588 | Grad_l2 --> 0.462 | Weights_l2 --> 11221.197 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:09:14,257][Main][INFO] - [train] Step 9900 out of 65536 | Loss --> 3.581 | Grad_l2 --> 0.463 | Weights_l2 --> 11244.552 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:09:47,568][Main][INFO] - [train] Step 10000 out of 65536 | Loss --> 3.558 | Grad_l2 --> 0.436 | Weights_l2 --> 11267.926 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:09:47,569][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-10000 [2024-07-27 08:09:47,571][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 08:09:49,002][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-10000/model.safetensors [2024-07-27 08:09:50,627][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-10000/optimizer.bin [2024-07-27 08:09:50,628][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-10000/scheduler.bin [2024-07-27 08:09:50,628][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-10000/sampler.bin [2024-07-27 08:09:50,628][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-10000/sampler_1.bin [2024-07-27 08:09:50,628][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-10000/random_states_0.pkl [2024-07-27 08:10:26,125][Main][INFO] - [train] Step 10100 out of 65536 | Loss --> 3.552 | Grad_l2 --> 0.450 | Weights_l2 --> 11291.800 | Lr --> 0.020 | Seconds_per_step --> 0.386 | [2024-07-27 08:11:00,677][Main][INFO] - [train] Step 10200 out of 65536 | Loss --> 3.551 | Grad_l2 --> 0.432 | Weights_l2 --> 11315.670 | Lr --> 0.020 | Seconds_per_step --> 0.346 | [2024-07-27 08:11:35,588][Main][INFO] - [train] Step 10300 out of 65536 | Loss --> 3.532 | Grad_l2 --> 0.430 | Weights_l2 --> 11339.401 | Lr --> 0.020 | Seconds_per_step --> 0.349 | [2024-07-27 08:12:10,535][Main][INFO] - [train] Step 10400 out of 65536 | Loss --> 3.528 | Grad_l2 --> 0.454 | Weights_l2 --> 11363.218 | Lr --> 0.020 | Seconds_per_step --> 0.349 | [2024-07-27 08:12:43,957][Main][INFO] - [train] Step 10500 out of 65536 | Loss --> 3.516 | Grad_l2 --> 0.445 | Weights_l2 --> 11386.696 | Lr --> 0.020 | Seconds_per_step --> 0.334 | [2024-07-27 08:13:17,239][Main][INFO] - [train] Step 10600 out of 65536 | Loss --> 3.513 | Grad_l2 --> 0.430 | Weights_l2 --> 11410.171 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:13:50,543][Main][INFO] - [train] Step 10700 out of 65536 | Loss --> 3.495 | Grad_l2 --> 0.420 | Weights_l2 --> 11433.378 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:14:23,825][Main][INFO] - [train] Step 10800 out of 65536 | Loss --> 3.490 | Grad_l2 --> 0.443 | Weights_l2 --> 11457.010 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:14:57,411][Main][INFO] - [train] Step 10900 out of 65536 | Loss --> 3.482 | Grad_l2 --> 0.427 | Weights_l2 --> 11480.168 | Lr --> 0.020 | Seconds_per_step --> 0.336 | [2024-07-27 08:15:30,876][Main][INFO] - [train] Step 11000 out of 65536 | Loss --> 3.475 | Grad_l2 --> 0.411 | Weights_l2 --> 11503.757 | Lr --> 0.020 | Seconds_per_step --> 0.335 | [2024-07-27 08:16:05,551][Main][INFO] - [train] Step 11100 out of 65536 | Loss --> 3.471 | Grad_l2 --> 0.432 | Weights_l2 --> 11527.127 | Lr --> 0.020 | Seconds_per_step --> 0.347 | [2024-07-27 08:16:40,139][Main][INFO] - [train] Step 11200 out of 65536 | Loss --> 3.489 | Grad_l2 --> 0.447 | Weights_l2 --> 11550.928 | Lr --> 0.020 | Seconds_per_step --> 0.346 | [2024-07-27 08:17:15,674][Main][INFO] - [train] Step 11300 out of 65536 | Loss --> 3.454 | Grad_l2 --> 0.442 | Weights_l2 --> 11574.032 | Lr --> 0.020 | Seconds_per_step --> 0.355 | [2024-07-27 08:17:49,107][Main][INFO] - [train] Step 11400 out of 65536 | Loss --> 3.452 | Grad_l2 --> 0.452 | Weights_l2 --> 11597.980 | Lr --> 0.020 | Seconds_per_step --> 0.334 | [2024-07-27 08:18:23,253][Main][INFO] - [train] Step 11500 out of 65536 | Loss --> 3.448 | Grad_l2 --> 0.436 | Weights_l2 --> 11621.734 | Lr --> 0.020 | Seconds_per_step --> 0.341 | [2024-07-27 08:18:56,704][Main][INFO] - [train] Step 11600 out of 65536 | Loss --> 3.436 | Grad_l2 --> 0.413 | Weights_l2 --> 11644.755 | Lr --> 0.020 | Seconds_per_step --> 0.335 | [2024-07-27 08:19:30,293][Main][INFO] - [train] Step 11700 out of 65536 | Loss --> 3.429 | Grad_l2 --> 0.424 | Weights_l2 --> 11667.986 | Lr --> 0.020 | Seconds_per_step --> 0.336 | [2024-07-27 08:20:03,690][Main][INFO] - [train] Step 11800 out of 65536 | Loss --> 3.417 | Grad_l2 --> 0.427 | Weights_l2 --> 11691.306 | Lr --> 0.020 | Seconds_per_step --> 0.334 | [2024-07-27 08:20:36,970][Main][INFO] - [train] Step 11900 out of 65536 | Loss --> 3.406 | Grad_l2 --> 0.418 | Weights_l2 --> 11714.365 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:21:10,469][Main][INFO] - [train] Step 12000 out of 65536 | Loss --> 3.408 | Grad_l2 --> 0.439 | Weights_l2 --> 11737.743 | Lr --> 0.020 | Seconds_per_step --> 0.335 | [2024-07-27 08:21:44,061][Main][INFO] - [train] Step 12100 out of 65536 | Loss --> 3.404 | Grad_l2 --> 0.407 | Weights_l2 --> 11760.802 | Lr --> 0.020 | Seconds_per_step --> 0.336 | [2024-07-27 08:22:17,563][Main][INFO] - [train] Step 12200 out of 65536 | Loss --> 3.382 | Grad_l2 --> 0.410 | Weights_l2 --> 11783.958 | Lr --> 0.020 | Seconds_per_step --> 0.335 | [2024-07-27 08:22:51,081][Main][INFO] - [train] Step 12300 out of 65536 | Loss --> 3.377 | Grad_l2 --> 0.423 | Weights_l2 --> 11806.758 | Lr --> 0.020 | Seconds_per_step --> 0.335 | [2024-07-27 08:23:24,613][Main][INFO] - [train] Step 12400 out of 65536 | Loss --> 3.361 | Grad_l2 --> 0.405 | Weights_l2 --> 11829.366 | Lr --> 0.020 | Seconds_per_step --> 0.335 | [2024-07-27 08:23:59,081][Main][INFO] - [train] Step 12500 out of 65536 | Loss --> 3.369 | Grad_l2 --> 0.428 | Weights_l2 --> 11852.095 | Lr --> 0.020 | Seconds_per_step --> 0.345 | [2024-07-27 08:23:59,081][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-12500 [2024-07-27 08:23:59,084][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 08:24:00,677][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-12500/model.safetensors [2024-07-27 08:24:02,616][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-12500/optimizer.bin [2024-07-27 08:24:02,616][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-12500/scheduler.bin [2024-07-27 08:24:02,616][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-12500/sampler.bin [2024-07-27 08:24:02,616][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-12500/sampler_1.bin [2024-07-27 08:24:02,617][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-12500/random_states_0.pkl [2024-07-27 08:24:36,815][Main][INFO] - [train] Step 12600 out of 65536 | Loss --> 3.355 | Grad_l2 --> 0.403 | Weights_l2 --> 11875.041 | Lr --> 0.020 | Seconds_per_step --> 0.377 | [2024-07-27 08:25:10,423][Main][INFO] - [train] Step 12700 out of 65536 | Loss --> 3.355 | Grad_l2 --> 0.410 | Weights_l2 --> 11898.021 | Lr --> 0.020 | Seconds_per_step --> 0.336 | [2024-07-27 08:25:43,771][Main][INFO] - [train] Step 12800 out of 65536 | Loss --> 3.342 | Grad_l2 --> 0.405 | Weights_l2 --> 11920.759 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:26:17,131][Main][INFO] - [train] Step 12900 out of 65536 | Loss --> 3.336 | Grad_l2 --> 0.398 | Weights_l2 --> 11943.286 | Lr --> 0.020 | Seconds_per_step --> 0.334 | [2024-07-27 08:26:50,504][Main][INFO] - [train] Step 13000 out of 65536 | Loss --> 3.337 | Grad_l2 --> 0.426 | Weights_l2 --> 11966.051 | Lr --> 0.020 | Seconds_per_step --> 0.334 | [2024-07-27 08:27:23,867][Main][INFO] - [train] Step 13100 out of 65536 | Loss --> 3.336 | Grad_l2 --> 0.411 | Weights_l2 --> 11988.551 | Lr --> 0.020 | Seconds_per_step --> 0.334 | [2024-07-27 08:27:57,241][Main][INFO] - [train] Step 13200 out of 65536 | Loss --> 3.343 | Grad_l2 --> 0.505 | Weights_l2 --> 12011.940 | Lr --> 0.020 | Seconds_per_step --> 0.334 | [2024-07-27 08:28:31,401][Main][INFO] - [train] Step 13300 out of 65536 | Loss --> 3.348 | Grad_l2 --> 0.428 | Weights_l2 --> 12035.923 | Lr --> 0.020 | Seconds_per_step --> 0.342 | [2024-07-27 08:29:04,880][Main][INFO] - [train] Step 13400 out of 65536 | Loss --> 3.318 | Grad_l2 --> 0.408 | Weights_l2 --> 12058.360 | Lr --> 0.020 | Seconds_per_step --> 0.335 | [2024-07-27 08:29:38,321][Main][INFO] - [train] Step 13500 out of 65536 | Loss --> 3.306 | Grad_l2 --> 0.410 | Weights_l2 --> 12080.518 | Lr --> 0.020 | Seconds_per_step --> 0.334 | [2024-07-27 08:30:11,652][Main][INFO] - [train] Step 13600 out of 65536 | Loss --> 3.309 | Grad_l2 --> 0.405 | Weights_l2 --> 12102.873 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:30:44,927][Main][INFO] - [train] Step 13700 out of 65536 | Loss --> 3.293 | Grad_l2 --> 0.416 | Weights_l2 --> 12125.297 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:31:19,528][Main][INFO] - [train] Step 13800 out of 65536 | Loss --> 3.304 | Grad_l2 --> 0.416 | Weights_l2 --> 12147.473 | Lr --> 0.020 | Seconds_per_step --> 0.346 | [2024-07-27 08:31:53,099][Main][INFO] - [train] Step 13900 out of 65536 | Loss --> 3.289 | Grad_l2 --> 0.413 | Weights_l2 --> 12169.571 | Lr --> 0.020 | Seconds_per_step --> 0.336 | [2024-07-27 08:32:26,413][Main][INFO] - [train] Step 14000 out of 65536 | Loss --> 3.271 | Grad_l2 --> 0.415 | Weights_l2 --> 12191.813 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:32:59,693][Main][INFO] - [train] Step 14100 out of 65536 | Loss --> 3.276 | Grad_l2 --> 0.391 | Weights_l2 --> 12213.996 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:33:33,008][Main][INFO] - [train] Step 14200 out of 65536 | Loss --> 3.259 | Grad_l2 --> 0.396 | Weights_l2 --> 12236.230 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:34:06,281][Main][INFO] - [train] Step 14300 out of 65536 | Loss --> 3.269 | Grad_l2 --> 0.388 | Weights_l2 --> 12258.535 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:34:39,575][Main][INFO] - [train] Step 14400 out of 65536 | Loss --> 3.256 | Grad_l2 --> 0.387 | Weights_l2 --> 12280.683 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:35:13,178][Main][INFO] - [train] Step 14500 out of 65536 | Loss --> 3.256 | Grad_l2 --> 0.401 | Weights_l2 --> 12302.888 | Lr --> 0.020 | Seconds_per_step --> 0.336 | [2024-07-27 08:35:46,611][Main][INFO] - [train] Step 14600 out of 65536 | Loss --> 3.268 | Grad_l2 --> 0.412 | Weights_l2 --> 12325.040 | Lr --> 0.020 | Seconds_per_step --> 0.334 | [2024-07-27 08:36:20,541][Main][INFO] - [train] Step 14700 out of 65536 | Loss --> 3.246 | Grad_l2 --> 0.410 | Weights_l2 --> 12347.146 | Lr --> 0.020 | Seconds_per_step --> 0.339 | [2024-07-27 08:36:53,845][Main][INFO] - [train] Step 14800 out of 65536 | Loss --> 3.254 | Grad_l2 --> 0.388 | Weights_l2 --> 12369.487 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:37:27,130][Main][INFO] - [train] Step 14900 out of 65536 | Loss --> 3.244 | Grad_l2 --> 0.411 | Weights_l2 --> 12391.786 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:38:00,425][Main][INFO] - [train] Step 15000 out of 65536 | Loss --> 3.229 | Grad_l2 --> 0.423 | Weights_l2 --> 12413.940 | Lr --> 0.020 | Seconds_per_step --> 0.333 | [2024-07-27 08:38:00,426][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-15000 [2024-07-27 08:38:00,428][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 08:38:01,868][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-15000/model.safetensors [2024-07-27 08:38:03,397][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-15000/optimizer.bin [2024-07-27 08:38:03,398][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-15000/scheduler.bin [2024-07-27 08:38:03,398][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-15000/sampler.bin [2024-07-27 08:38:03,398][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-15000/sampler_1.bin [2024-07-27 08:38:03,398][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-15000/random_states_0.pkl [2024-07-27 08:38:37,273][Main][INFO] - [train] Step 15100 out of 65536 | Loss --> 3.244 | Grad_l2 --> 0.400 | Weights_l2 --> 12436.022 | Lr --> 0.020 | Seconds_per_step --> 0.368 | [2024-07-27 08:39:11,208][Main][INFO] - [train] Step 15200 out of 65536 | Loss --> 3.237 | Grad_l2 --> 0.404 | Weights_l2 --> 12458.123 | Lr --> 0.020 | Seconds_per_step --> 0.339 | [2024-07-27 08:39:44,455][Main][INFO] - [train] Step 15300 out of 65536 | Loss --> 3.230 | Grad_l2 --> 0.395 | Weights_l2 --> 12480.551 | Lr --> 0.020 | Seconds_per_step --> 0.332 | [2024-07-27 08:40:19,085][Main][INFO] - [train] Step 15400 out of 65536 | Loss --> 3.213 | Grad_l2 --> 0.385 | Weights_l2 --> 12502.622 | Lr --> 0.020 | Seconds_per_step --> 0.346 | [2024-07-27 08:40:53,941][Main][INFO] - [train] Step 15500 out of 65536 | Loss --> 3.226 | Grad_l2 --> 0.392 | Weights_l2 --> 12524.591 | Lr --> 0.020 | Seconds_per_step --> 0.349 | [2024-07-27 08:41:30,306][Main][INFO] - [train] Step 15600 out of 65536 | Loss --> 3.213 | Grad_l2 --> 0.392 | Weights_l2 --> 12546.455 | Lr --> 0.020 | Seconds_per_step --> 0.364 | [2024-07-27 08:42:04,136][Main][INFO] - [train] Step 15700 out of 65536 | Loss --> 3.205 | Grad_l2 --> 0.381 | Weights_l2 --> 12568.052 | Lr --> 0.019 | Seconds_per_step --> 0.338 | [2024-07-27 08:42:37,478][Main][INFO] - [train] Step 15800 out of 65536 | Loss --> 3.187 | Grad_l2 --> 0.386 | Weights_l2 --> 12589.610 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:43:10,820][Main][INFO] - [train] Step 15900 out of 65536 | Loss --> 3.190 | Grad_l2 --> 0.378 | Weights_l2 --> 12611.232 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:43:44,162][Main][INFO] - [train] Step 16000 out of 65536 | Loss --> 3.187 | Grad_l2 --> 0.413 | Weights_l2 --> 12633.117 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:44:17,526][Main][INFO] - [train] Step 16100 out of 65536 | Loss --> 3.198 | Grad_l2 --> 0.393 | Weights_l2 --> 12655.032 | Lr --> 0.019 | Seconds_per_step --> 0.334 | [2024-07-27 08:44:50,870][Main][INFO] - [train] Step 16200 out of 65536 | Loss --> 3.187 | Grad_l2 --> 0.390 | Weights_l2 --> 12676.258 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:45:24,228][Main][INFO] - [train] Step 16300 out of 65536 | Loss --> 3.194 | Grad_l2 --> 0.391 | Weights_l2 --> 12698.233 | Lr --> 0.019 | Seconds_per_step --> 0.334 | [2024-07-27 08:45:58,097][Main][INFO] - [train] Step 16400 out of 65536 | Loss --> 3.203 | Grad_l2 --> 0.467 | Weights_l2 --> 12722.017 | Lr --> 0.019 | Seconds_per_step --> 0.339 | [2024-07-27 08:46:31,641][Main][INFO] - [train] Step 16500 out of 65536 | Loss --> 3.182 | Grad_l2 --> 0.410 | Weights_l2 --> 12743.822 | Lr --> 0.019 | Seconds_per_step --> 0.335 | [2024-07-27 08:47:05,185][Main][INFO] - [train] Step 16600 out of 65536 | Loss --> 3.175 | Grad_l2 --> 0.389 | Weights_l2 --> 12765.289 | Lr --> 0.019 | Seconds_per_step --> 0.335 | [2024-07-27 08:47:38,599][Main][INFO] - [train] Step 16700 out of 65536 | Loss --> 3.168 | Grad_l2 --> 0.369 | Weights_l2 --> 12786.711 | Lr --> 0.019 | Seconds_per_step --> 0.334 | [2024-07-27 08:48:11,933][Main][INFO] - [train] Step 16800 out of 65536 | Loss --> 3.155 | Grad_l2 --> 0.390 | Weights_l2 --> 12807.939 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:48:45,260][Main][INFO] - [train] Step 16900 out of 65536 | Loss --> 3.143 | Grad_l2 --> 0.391 | Weights_l2 --> 12829.546 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:49:18,937][Main][INFO] - [train] Step 17000 out of 65536 | Loss --> 3.146 | Grad_l2 --> 0.387 | Weights_l2 --> 12850.919 | Lr --> 0.019 | Seconds_per_step --> 0.337 | [2024-07-27 08:49:52,258][Main][INFO] - [train] Step 17100 out of 65536 | Loss --> 3.156 | Grad_l2 --> 0.386 | Weights_l2 --> 12872.437 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:50:25,588][Main][INFO] - [train] Step 17200 out of 65536 | Loss --> 3.152 | Grad_l2 --> 0.387 | Weights_l2 --> 12893.716 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:50:58,909][Main][INFO] - [train] Step 17300 out of 65536 | Loss --> 3.125 | Grad_l2 --> 0.384 | Weights_l2 --> 12914.777 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:51:32,232][Main][INFO] - [train] Step 17400 out of 65536 | Loss --> 3.142 | Grad_l2 --> 0.386 | Weights_l2 --> 12935.562 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:52:05,569][Main][INFO] - [train] Step 17500 out of 65536 | Loss --> 3.127 | Grad_l2 --> 0.379 | Weights_l2 --> 12956.991 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:52:05,569][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-17500 [2024-07-27 08:52:05,571][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 08:52:07,013][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-17500/model.safetensors [2024-07-27 08:52:08,710][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-17500/optimizer.bin [2024-07-27 08:52:08,710][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-17500/scheduler.bin [2024-07-27 08:52:08,710][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-17500/sampler.bin [2024-07-27 08:52:08,710][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-17500/sampler_1.bin [2024-07-27 08:52:08,711][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-17500/random_states_0.pkl [2024-07-27 08:52:42,403][Main][INFO] - [train] Step 17600 out of 65536 | Loss --> 3.135 | Grad_l2 --> 0.384 | Weights_l2 --> 12977.871 | Lr --> 0.019 | Seconds_per_step --> 0.368 | [2024-07-27 08:53:15,733][Main][INFO] - [train] Step 17700 out of 65536 | Loss --> 3.140 | Grad_l2 --> 0.388 | Weights_l2 --> 12998.878 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:53:49,036][Main][INFO] - [train] Step 17800 out of 65536 | Loss --> 3.138 | Grad_l2 --> 0.389 | Weights_l2 --> 13019.866 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:54:22,355][Main][INFO] - [train] Step 17900 out of 65536 | Loss --> 3.124 | Grad_l2 --> 0.380 | Weights_l2 --> 13040.918 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:54:55,688][Main][INFO] - [train] Step 18000 out of 65536 | Loss --> 3.123 | Grad_l2 --> 0.376 | Weights_l2 --> 13061.731 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:55:29,034][Main][INFO] - [train] Step 18100 out of 65536 | Loss --> 3.115 | Grad_l2 --> 0.391 | Weights_l2 --> 13082.457 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:56:02,656][Main][INFO] - [train] Step 18200 out of 65536 | Loss --> 3.121 | Grad_l2 --> 0.396 | Weights_l2 --> 13103.418 | Lr --> 0.019 | Seconds_per_step --> 0.336 | [2024-07-27 08:56:35,967][Main][INFO] - [train] Step 18300 out of 65536 | Loss --> 3.125 | Grad_l2 --> 0.382 | Weights_l2 --> 13124.559 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:57:09,255][Main][INFO] - [train] Step 18400 out of 65536 | Loss --> 3.101 | Grad_l2 --> 0.396 | Weights_l2 --> 13145.684 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:57:42,567][Main][INFO] - [train] Step 18500 out of 65536 | Loss --> 3.106 | Grad_l2 --> 0.390 | Weights_l2 --> 13166.864 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:58:15,867][Main][INFO] - [train] Step 18600 out of 65536 | Loss --> 3.109 | Grad_l2 --> 0.390 | Weights_l2 --> 13187.974 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:58:49,190][Main][INFO] - [train] Step 18700 out of 65536 | Loss --> 3.103 | Grad_l2 --> 0.383 | Weights_l2 --> 13209.159 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 08:59:22,789][Main][INFO] - [train] Step 18800 out of 65536 | Loss --> 3.097 | Grad_l2 --> 0.387 | Weights_l2 --> 13230.137 | Lr --> 0.019 | Seconds_per_step --> 0.336 | [2024-07-27 08:59:56,122][Main][INFO] - [train] Step 18900 out of 65536 | Loss --> 3.096 | Grad_l2 --> 0.374 | Weights_l2 --> 13251.031 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 09:00:29,437][Main][INFO] - [train] Step 19000 out of 65536 | Loss --> 3.143 | Grad_l2 --> 0.507 | Weights_l2 --> 13274.056 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 09:01:02,756][Main][INFO] - [train] Step 19100 out of 65536 | Loss --> 3.102 | Grad_l2 --> 0.385 | Weights_l2 --> 13295.359 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 09:01:36,059][Main][INFO] - [train] Step 19200 out of 65536 | Loss --> 3.091 | Grad_l2 --> 0.384 | Weights_l2 --> 13315.947 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 09:02:09,406][Main][INFO] - [train] Step 19300 out of 65536 | Loss --> 3.088 | Grad_l2 --> 0.363 | Weights_l2 --> 13336.247 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 09:02:44,212][Main][INFO] - [train] Step 19400 out of 65536 | Loss --> 3.070 | Grad_l2 --> 0.393 | Weights_l2 --> 13356.639 | Lr --> 0.019 | Seconds_per_step --> 0.348 | [2024-07-27 09:03:18,653][Main][INFO] - [train] Step 19500 out of 65536 | Loss --> 3.070 | Grad_l2 --> 0.388 | Weights_l2 --> 13376.838 | Lr --> 0.019 | Seconds_per_step --> 0.344 | [2024-07-27 09:03:52,234][Main][INFO] - [train] Step 19600 out of 65536 | Loss --> 3.058 | Grad_l2 --> 0.376 | Weights_l2 --> 13397.226 | Lr --> 0.019 | Seconds_per_step --> 0.336 | [2024-07-27 09:04:25,516][Main][INFO] - [train] Step 19700 out of 65536 | Loss --> 3.071 | Grad_l2 --> 0.399 | Weights_l2 --> 13417.673 | Lr --> 0.019 | Seconds_per_step --> 0.333 | [2024-07-27 09:04:58,762][Main][INFO] - [train] Step 19800 out of 65536 | Loss --> 3.063 | Grad_l2 --> 0.399 | Weights_l2 --> 13438.034 | Lr --> 0.019 | Seconds_per_step --> 0.332 | [2024-07-27 09:05:32,010][Main][INFO] - [train] Step 19900 out of 65536 | Loss --> 3.053 | Grad_l2 --> 0.392 | Weights_l2 --> 13458.454 | Lr --> 0.018 | Seconds_per_step --> 0.332 | [2024-07-27 09:06:05,575][Main][INFO] - [train] Step 20000 out of 65536 | Loss --> 3.068 | Grad_l2 --> 0.395 | Weights_l2 --> 13478.586 | Lr --> 0.018 | Seconds_per_step --> 0.336 | [2024-07-27 09:06:05,575][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-20000 [2024-07-27 09:06:05,577][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 09:06:07,259][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-20000/model.safetensors [2024-07-27 09:06:09,089][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-20000/optimizer.bin [2024-07-27 09:06:09,090][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-20000/scheduler.bin [2024-07-27 09:06:09,090][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-20000/sampler.bin [2024-07-27 09:06:09,090][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-20000/sampler_1.bin [2024-07-27 09:06:09,090][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-20000/random_states_0.pkl [2024-07-27 09:06:42,384][Main][INFO] - [train] Step 20100 out of 65536 | Loss --> 3.060 | Grad_l2 --> 0.383 | Weights_l2 --> 13498.704 | Lr --> 0.018 | Seconds_per_step --> 0.368 | [2024-07-27 09:07:15,677][Main][INFO] - [train] Step 20200 out of 65536 | Loss --> 3.039 | Grad_l2 --> 0.380 | Weights_l2 --> 13518.555 | Lr --> 0.018 | Seconds_per_step --> 0.333 | [2024-07-27 09:07:48,974][Main][INFO] - [train] Step 20300 out of 65536 | Loss --> 3.049 | Grad_l2 --> 0.364 | Weights_l2 --> 13538.138 | Lr --> 0.018 | Seconds_per_step --> 0.333 | [2024-07-27 09:08:22,257][Main][INFO] - [train] Step 20400 out of 65536 | Loss --> 3.029 | Grad_l2 --> 0.405 | Weights_l2 --> 13557.751 | Lr --> 0.018 | Seconds_per_step --> 0.333 | [2024-07-27 09:08:55,551][Main][INFO] - [train] Step 20500 out of 65536 | Loss --> 3.035 | Grad_l2 --> 0.379 | Weights_l2 --> 13577.876 | Lr --> 0.018 | Seconds_per_step --> 0.333 | [2024-07-27 09:09:29,169][Main][INFO] - [train] Step 20600 out of 65536 | Loss --> 3.041 | Grad_l2 --> 0.381 | Weights_l2 --> 13597.657 | Lr --> 0.018 | Seconds_per_step --> 0.336 | [2024-07-27 09:10:02,467][Main][INFO] - [train] Step 20700 out of 65536 | Loss --> 3.045 | Grad_l2 --> 0.369 | Weights_l2 --> 13617.574 | Lr --> 0.018 | Seconds_per_step --> 0.333 | [2024-07-27 09:10:35,883][Main][INFO] - [train] Step 20800 out of 65536 | Loss --> 3.032 | Grad_l2 --> 0.393 | Weights_l2 --> 13637.822 | Lr --> 0.018 | Seconds_per_step --> 0.334 | [2024-07-27 09:11:09,311][Main][INFO] - [train] Step 20900 out of 65536 | Loss --> 3.024 | Grad_l2 --> 0.390 | Weights_l2 --> 13657.534 | Lr --> 0.018 | Seconds_per_step --> 0.334 | [2024-07-27 09:11:42,841][Main][INFO] - [train] Step 21000 out of 65536 | Loss --> 3.011 | Grad_l2 --> 0.405 | Weights_l2 --> 13677.109 | Lr --> 0.018 | Seconds_per_step --> 0.335 | [2024-07-27 09:12:16,450][Main][INFO] - [train] Step 21100 out of 65536 | Loss --> 3.013 | Grad_l2 --> 0.397 | Weights_l2 --> 13696.888 | Lr --> 0.018 | Seconds_per_step --> 0.336 | [2024-07-27 09:12:51,271][Main][INFO] - [train] Step 21200 out of 65536 | Loss --> 3.010 | Grad_l2 --> 0.381 | Weights_l2 --> 13716.615 | Lr --> 0.018 | Seconds_per_step --> 0.348 | [2024-07-27 09:13:24,794][Main][INFO] - [train] Step 21300 out of 65536 | Loss --> 3.014 | Grad_l2 --> 0.382 | Weights_l2 --> 13736.177 | Lr --> 0.018 | Seconds_per_step --> 0.335 | [2024-07-27 09:13:59,895][Main][INFO] - [train] Step 21400 out of 65536 | Loss --> 3.005 | Grad_l2 --> 0.401 | Weights_l2 --> 13755.703 | Lr --> 0.018 | Seconds_per_step --> 0.351 | [2024-07-27 09:14:33,255][Main][INFO] - [train] Step 21500 out of 65536 | Loss --> 3.013 | Grad_l2 --> 0.382 | Weights_l2 --> 13774.937 | Lr --> 0.018 | Seconds_per_step --> 0.334 | [2024-07-27 09:15:06,572][Main][INFO] - [train] Step 21600 out of 65536 | Loss --> 3.015 | Grad_l2 --> 0.380 | Weights_l2 --> 13794.110 | Lr --> 0.018 | Seconds_per_step --> 0.333 | [2024-07-27 09:15:39,878][Main][INFO] - [train] Step 21700 out of 65536 | Loss --> 3.002 | Grad_l2 --> 0.399 | Weights_l2 --> 13813.187 | Lr --> 0.018 | Seconds_per_step --> 0.333 | [2024-07-27 09:16:13,514][Main][INFO] - [train] Step 21800 out of 65536 | Loss --> 3.015 | Grad_l2 --> 0.389 | Weights_l2 --> 13832.329 | Lr --> 0.018 | Seconds_per_step --> 0.336 | [2024-07-27 09:16:46,832][Main][INFO] - [train] Step 21900 out of 65536 | Loss --> 2.996 | Grad_l2 --> 0.390 | Weights_l2 --> 13852.133 | Lr --> 0.018 | Seconds_per_step --> 0.333 | [2024-07-27 09:17:20,144][Main][INFO] - [train] Step 22000 out of 65536 | Loss --> 3.005 | Grad_l2 --> 0.392 | Weights_l2 --> 13871.386 | Lr --> 0.018 | Seconds_per_step --> 0.333 | [2024-07-27 09:17:53,453][Main][INFO] - [train] Step 22100 out of 65536 | Loss --> 2.997 | Grad_l2 --> 0.396 | Weights_l2 --> 13890.531 | Lr --> 0.018 | Seconds_per_step --> 0.333 | [2024-07-27 09:18:26,767][Main][INFO] - [train] Step 22200 out of 65536 | Loss --> 2.996 | Grad_l2 --> 0.399 | Weights_l2 --> 13910.095 | Lr --> 0.018 | Seconds_per_step --> 0.333 | [2024-07-27 09:19:00,085][Main][INFO] - [train] Step 22300 out of 65536 | Loss --> 2.999 | Grad_l2 --> 0.400 | Weights_l2 --> 13929.645 | Lr --> 0.018 | Seconds_per_step --> 0.333 | [2024-07-27 09:19:33,396][Main][INFO] - [train] Step 22400 out of 65536 | Loss --> 2.978 | Grad_l2 --> 0.395 | Weights_l2 --> 13948.438 | Lr --> 0.018 | Seconds_per_step --> 0.333 | [2024-07-27 09:20:06,995][Main][INFO] - [train] Step 22500 out of 65536 | Loss --> 2.990 | Grad_l2 --> 0.377 | Weights_l2 --> 13967.494 | Lr --> 0.018 | Seconds_per_step --> 0.336 | [2024-07-27 09:20:06,996][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-22500 [2024-07-27 09:20:06,998][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 09:20:08,679][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-22500/model.safetensors [2024-07-27 09:20:10,574][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-22500/optimizer.bin [2024-07-27 09:20:10,575][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-22500/scheduler.bin [2024-07-27 09:20:10,575][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-22500/sampler.bin [2024-07-27 09:20:10,575][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-22500/sampler_1.bin [2024-07-27 09:20:10,575][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-22500/random_states_0.pkl [2024-07-27 09:20:45,699][Main][INFO] - [train] Step 22600 out of 65536 | Loss --> 2.985 | Grad_l2 --> 0.394 | Weights_l2 --> 13986.354 | Lr --> 0.018 | Seconds_per_step --> 0.387 | [2024-07-27 09:21:19,018][Main][INFO] - [train] Step 22700 out of 65536 | Loss --> 2.962 | Grad_l2 --> 0.386 | Weights_l2 --> 14005.368 | Lr --> 0.018 | Seconds_per_step --> 0.333 | [2024-07-27 09:21:52,339][Main][INFO] - [train] Step 22800 out of 65536 | Loss --> 2.965 | Grad_l2 --> 0.390 | Weights_l2 --> 14024.336 | Lr --> 0.017 | Seconds_per_step --> 0.333 | [2024-07-27 09:22:25,672][Main][INFO] - [train] Step 22900 out of 65536 | Loss --> 2.972 | Grad_l2 --> 0.373 | Weights_l2 --> 14043.289 | Lr --> 0.017 | Seconds_per_step --> 0.333 | [2024-07-27 09:22:59,843][Main][INFO] - [train] Step 23000 out of 65536 | Loss --> 2.976 | Grad_l2 --> 0.395 | Weights_l2 --> 14062.288 | Lr --> 0.017 | Seconds_per_step --> 0.342 | [2024-07-27 09:23:33,445][Main][INFO] - [train] Step 23100 out of 65536 | Loss --> 2.961 | Grad_l2 --> 0.372 | Weights_l2 --> 14081.192 | Lr --> 0.017 | Seconds_per_step --> 0.336 | [2024-07-27 09:24:06,741][Main][INFO] - [train] Step 23200 out of 65536 | Loss --> 2.968 | Grad_l2 --> 0.394 | Weights_l2 --> 14099.935 | Lr --> 0.017 | Seconds_per_step --> 0.333 | [2024-07-27 09:24:40,030][Main][INFO] - [train] Step 23300 out of 65536 | Loss --> 2.959 | Grad_l2 --> 0.384 | Weights_l2 --> 14118.424 | Lr --> 0.017 | Seconds_per_step --> 0.333 | [2024-07-27 09:25:13,327][Main][INFO] - [train] Step 23400 out of 65536 | Loss --> 2.953 | Grad_l2 --> 0.388 | Weights_l2 --> 14136.944 | Lr --> 0.017 | Seconds_per_step --> 0.333 | [2024-07-27 09:25:46,625][Main][INFO] - [train] Step 23500 out of 65536 | Loss --> 2.965 | Grad_l2 --> 0.394 | Weights_l2 --> 14155.350 | Lr --> 0.017 | Seconds_per_step --> 0.333 | [2024-07-27 09:26:19,981][Main][INFO] - [train] Step 23600 out of 65536 | Loss --> 2.956 | Grad_l2 --> 0.395 | Weights_l2 --> 14173.634 | Lr --> 0.017 | Seconds_per_step --> 0.334 | [2024-07-27 09:26:55,551][Main][INFO] - [train] Step 23700 out of 65536 | Loss --> 2.951 | Grad_l2 --> 0.390 | Weights_l2 --> 14192.105 | Lr --> 0.017 | Seconds_per_step --> 0.356 | [2024-07-27 09:27:28,895][Main][INFO] - [train] Step 23800 out of 65536 | Loss --> 2.954 | Grad_l2 --> 0.414 | Weights_l2 --> 14210.466 | Lr --> 0.017 | Seconds_per_step --> 0.333 | [2024-07-27 09:28:02,171][Main][INFO] - [train] Step 23900 out of 65536 | Loss --> 2.941 | Grad_l2 --> 0.393 | Weights_l2 --> 14228.623 | Lr --> 0.017 | Seconds_per_step --> 0.333 | [2024-07-27 09:28:35,438][Main][INFO] - [train] Step 24000 out of 65536 | Loss --> 2.958 | Grad_l2 --> 0.379 | Weights_l2 --> 14247.057 | Lr --> 0.017 | Seconds_per_step --> 0.333 | [2024-07-27 09:29:08,757][Main][INFO] - [train] Step 24100 out of 65536 | Loss --> 2.949 | Grad_l2 --> 0.395 | Weights_l2 --> 14265.046 | Lr --> 0.017 | Seconds_per_step --> 0.333 | [2024-07-27 09:29:42,029][Main][INFO] - [train] Step 24200 out of 65536 | Loss --> 2.943 | Grad_l2 --> 0.381 | Weights_l2 --> 14283.241 | Lr --> 0.017 | Seconds_per_step --> 0.333 | [2024-07-27 09:30:15,618][Main][INFO] - [train] Step 24300 out of 65536 | Loss --> 2.943 | Grad_l2 --> 0.398 | Weights_l2 --> 14301.409 | Lr --> 0.017 | Seconds_per_step --> 0.336 | [2024-07-27 09:30:48,909][Main][INFO] - [train] Step 24400 out of 65536 | Loss --> 2.977 | Grad_l2 --> 0.550 | Weights_l2 --> 14321.035 | Lr --> 0.017 | Seconds_per_step --> 0.333 | [2024-07-27 09:31:22,200][Main][INFO] - [train] Step 24500 out of 65536 | Loss --> 2.944 | Grad_l2 --> 0.390 | Weights_l2 --> 14339.144 | Lr --> 0.017 | Seconds_per_step --> 0.333 | [2024-07-27 09:31:57,259][Main][INFO] - [train] Step 24600 out of 65536 | Loss --> 2.931 | Grad_l2 --> 0.403 | Weights_l2 --> 14357.359 | Lr --> 0.017 | Seconds_per_step --> 0.351 | [2024-07-27 09:32:31,691][Main][INFO] - [train] Step 24700 out of 65536 | Loss --> 2.938 | Grad_l2 --> 0.395 | Weights_l2 --> 14374.879 | Lr --> 0.017 | Seconds_per_step --> 0.344 | [2024-07-27 09:33:05,280][Main][INFO] - [train] Step 24800 out of 65536 | Loss --> 2.935 | Grad_l2 --> 0.378 | Weights_l2 --> 14392.869 | Lr --> 0.017 | Seconds_per_step --> 0.336 | [2024-07-27 09:33:38,873][Main][INFO] - [train] Step 24900 out of 65536 | Loss --> 2.933 | Grad_l2 --> 0.400 | Weights_l2 --> 14410.644 | Lr --> 0.017 | Seconds_per_step --> 0.336 | [2024-07-27 09:34:12,169][Main][INFO] - [train] Step 25000 out of 65536 | Loss --> 2.933 | Grad_l2 --> 0.401 | Weights_l2 --> 14428.140 | Lr --> 0.017 | Seconds_per_step --> 0.333 | [2024-07-27 09:34:12,169][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-25000 [2024-07-27 09:34:12,171][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 09:34:13,844][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-25000/model.safetensors [2024-07-27 09:34:15,729][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-25000/optimizer.bin [2024-07-27 09:34:15,730][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-25000/scheduler.bin [2024-07-27 09:34:15,730][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-25000/sampler.bin [2024-07-27 09:34:15,730][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-25000/sampler_1.bin [2024-07-27 09:34:15,730][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-25000/random_states_0.pkl [2024-07-27 09:34:49,061][Main][INFO] - [train] Step 25100 out of 65536 | Loss --> 2.913 | Grad_l2 --> 0.406 | Weights_l2 --> 14445.549 | Lr --> 0.017 | Seconds_per_step --> 0.369 | [2024-07-27 09:35:22,352][Main][INFO] - [train] Step 25200 out of 65536 | Loss --> 2.935 | Grad_l2 --> 0.396 | Weights_l2 --> 14463.018 | Lr --> 0.017 | Seconds_per_step --> 0.333 | [2024-07-27 09:35:55,640][Main][INFO] - [train] Step 25300 out of 65536 | Loss --> 2.922 | Grad_l2 --> 0.397 | Weights_l2 --> 14480.322 | Lr --> 0.016 | Seconds_per_step --> 0.333 | [2024-07-27 09:36:28,934][Main][INFO] - [train] Step 25400 out of 65536 | Loss --> 2.922 | Grad_l2 --> 0.394 | Weights_l2 --> 14497.791 | Lr --> 0.016 | Seconds_per_step --> 0.333 | [2024-07-27 09:37:02,548][Main][INFO] - [train] Step 25500 out of 65536 | Loss --> 2.907 | Grad_l2 --> 0.394 | Weights_l2 --> 14514.922 | Lr --> 0.016 | Seconds_per_step --> 0.336 | [2024-07-27 09:37:35,864][Main][INFO] - [train] Step 25600 out of 65536 | Loss --> 2.897 | Grad_l2 --> 0.395 | Weights_l2 --> 14531.877 | Lr --> 0.016 | Seconds_per_step --> 0.333 | [2024-07-27 09:38:09,179][Main][INFO] - [train] Step 25700 out of 65536 | Loss --> 2.915 | Grad_l2 --> 0.390 | Weights_l2 --> 14548.973 | Lr --> 0.016 | Seconds_per_step --> 0.333 | [2024-07-27 09:38:42,502][Main][INFO] - [train] Step 25800 out of 65536 | Loss --> 2.909 | Grad_l2 --> 0.395 | Weights_l2 --> 14566.039 | Lr --> 0.016 | Seconds_per_step --> 0.333 | [2024-07-27 09:39:15,842][Main][INFO] - [train] Step 25900 out of 65536 | Loss --> 2.902 | Grad_l2 --> 0.404 | Weights_l2 --> 14582.855 | Lr --> 0.016 | Seconds_per_step --> 0.333 | [2024-07-27 09:39:49,540][Main][INFO] - [train] Step 26000 out of 65536 | Loss --> 2.898 | Grad_l2 --> 0.390 | Weights_l2 --> 14599.689 | Lr --> 0.016 | Seconds_per_step --> 0.337 | [2024-07-27 09:40:23,379][Main][INFO] - [train] Step 26100 out of 65536 | Loss --> 2.903 | Grad_l2 --> 0.381 | Weights_l2 --> 14616.550 | Lr --> 0.016 | Seconds_per_step --> 0.338 | [2024-07-27 09:40:56,953][Main][INFO] - [train] Step 26200 out of 65536 | Loss --> 2.906 | Grad_l2 --> 0.388 | Weights_l2 --> 14633.084 | Lr --> 0.016 | Seconds_per_step --> 0.336 | [2024-07-27 09:41:30,360][Main][INFO] - [train] Step 26300 out of 65536 | Loss --> 2.885 | Grad_l2 --> 0.389 | Weights_l2 --> 14649.948 | Lr --> 0.016 | Seconds_per_step --> 0.334 | [2024-07-27 09:42:03,664][Main][INFO] - [train] Step 26400 out of 65536 | Loss --> 2.886 | Grad_l2 --> 0.401 | Weights_l2 --> 14666.622 | Lr --> 0.016 | Seconds_per_step --> 0.333 | [2024-07-27 09:42:37,214][Main][INFO] - [train] Step 26500 out of 65536 | Loss --> 2.890 | Grad_l2 --> 0.392 | Weights_l2 --> 14683.151 | Lr --> 0.016 | Seconds_per_step --> 0.335 | [2024-07-27 09:43:10,924][Main][INFO] - [train] Step 26600 out of 65536 | Loss --> 2.881 | Grad_l2 --> 0.396 | Weights_l2 --> 14699.749 | Lr --> 0.016 | Seconds_per_step --> 0.337 | [2024-07-27 09:43:44,520][Main][INFO] - [train] Step 26700 out of 65536 | Loss --> 2.873 | Grad_l2 --> 0.384 | Weights_l2 --> 14716.108 | Lr --> 0.016 | Seconds_per_step --> 0.336 | [2024-07-27 09:44:17,840][Main][INFO] - [train] Step 26800 out of 65536 | Loss --> 2.881 | Grad_l2 --> 0.389 | Weights_l2 --> 14732.534 | Lr --> 0.016 | Seconds_per_step --> 0.333 | [2024-07-27 09:44:51,120][Main][INFO] - [train] Step 26900 out of 65536 | Loss --> 2.873 | Grad_l2 --> 0.401 | Weights_l2 --> 14748.967 | Lr --> 0.016 | Seconds_per_step --> 0.333 | [2024-07-27 09:45:24,415][Main][INFO] - [train] Step 27000 out of 65536 | Loss --> 2.879 | Grad_l2 --> 0.402 | Weights_l2 --> 14765.211 | Lr --> 0.016 | Seconds_per_step --> 0.333 | [2024-07-27 09:45:58,128][Main][INFO] - [train] Step 27100 out of 65536 | Loss --> 2.887 | Grad_l2 --> 0.414 | Weights_l2 --> 14781.947 | Lr --> 0.016 | Seconds_per_step --> 0.337 | [2024-07-27 09:46:31,416][Main][INFO] - [train] Step 27200 out of 65536 | Loss --> 2.880 | Grad_l2 --> 0.405 | Weights_l2 --> 14798.081 | Lr --> 0.016 | Seconds_per_step --> 0.333 | [2024-07-27 09:47:04,999][Main][INFO] - [train] Step 27300 out of 65536 | Loss --> 2.907 | Grad_l2 --> 0.527 | Weights_l2 --> 14816.017 | Lr --> 0.016 | Seconds_per_step --> 0.336 | [2024-07-27 09:47:38,262][Main][INFO] - [train] Step 27400 out of 65536 | Loss --> 2.877 | Grad_l2 --> 0.403 | Weights_l2 --> 14832.256 | Lr --> 0.016 | Seconds_per_step --> 0.333 | [2024-07-27 09:48:11,524][Main][INFO] - [train] Step 27500 out of 65536 | Loss --> 2.870 | Grad_l2 --> 0.396 | Weights_l2 --> 14847.964 | Lr --> 0.015 | Seconds_per_step --> 0.333 | [2024-07-27 09:48:11,524][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-27500 [2024-07-27 09:48:11,526][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 09:48:13,205][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-27500/model.safetensors [2024-07-27 09:48:15,037][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-27500/optimizer.bin [2024-07-27 09:48:15,037][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-27500/scheduler.bin [2024-07-27 09:48:15,037][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-27500/sampler.bin [2024-07-27 09:48:15,037][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-27500/sampler_1.bin [2024-07-27 09:48:15,038][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-27500/random_states_0.pkl [2024-07-27 09:48:48,385][Main][INFO] - [train] Step 27600 out of 65536 | Loss --> 2.876 | Grad_l2 --> 0.396 | Weights_l2 --> 14863.628 | Lr --> 0.015 | Seconds_per_step --> 0.369 | [2024-07-27 09:49:21,681][Main][INFO] - [train] Step 27700 out of 65536 | Loss --> 2.850 | Grad_l2 --> 0.391 | Weights_l2 --> 14879.187 | Lr --> 0.015 | Seconds_per_step --> 0.333 | [2024-07-27 09:49:55,757][Main][INFO] - [train] Step 27800 out of 65536 | Loss --> 2.856 | Grad_l2 --> 0.410 | Weights_l2 --> 14894.746 | Lr --> 0.015 | Seconds_per_step --> 0.341 | [2024-07-27 09:50:30,593][Main][INFO] - [train] Step 27900 out of 65536 | Loss --> 2.857 | Grad_l2 --> 0.392 | Weights_l2 --> 14910.309 | Lr --> 0.015 | Seconds_per_step --> 0.348 | [2024-07-27 09:51:03,927][Main][INFO] - [train] Step 28000 out of 65536 | Loss --> 2.854 | Grad_l2 --> 0.406 | Weights_l2 --> 14925.710 | Lr --> 0.015 | Seconds_per_step --> 0.333 | [2024-07-27 09:51:37,245][Main][INFO] - [train] Step 28100 out of 65536 | Loss --> 2.855 | Grad_l2 --> 0.409 | Weights_l2 --> 14941.094 | Lr --> 0.015 | Seconds_per_step --> 0.333 | [2024-07-27 09:52:10,583][Main][INFO] - [train] Step 28200 out of 65536 | Loss --> 2.845 | Grad_l2 --> 0.388 | Weights_l2 --> 14956.044 | Lr --> 0.015 | Seconds_per_step --> 0.333 | [2024-07-27 09:52:43,899][Main][INFO] - [train] Step 28300 out of 65536 | Loss --> 2.848 | Grad_l2 --> 0.385 | Weights_l2 --> 14971.372 | Lr --> 0.015 | Seconds_per_step --> 0.333 | [2024-07-27 09:53:17,203][Main][INFO] - [train] Step 28400 out of 65536 | Loss --> 2.851 | Grad_l2 --> 0.390 | Weights_l2 --> 14986.724 | Lr --> 0.015 | Seconds_per_step --> 0.333 | [2024-07-27 09:53:50,539][Main][INFO] - [train] Step 28500 out of 65536 | Loss --> 2.834 | Grad_l2 --> 0.397 | Weights_l2 --> 15001.668 | Lr --> 0.015 | Seconds_per_step --> 0.333 | [2024-07-27 09:54:24,158][Main][INFO] - [train] Step 28600 out of 65536 | Loss --> 2.845 | Grad_l2 --> 0.396 | Weights_l2 --> 15016.717 | Lr --> 0.015 | Seconds_per_step --> 0.336 | [2024-07-27 09:54:57,487][Main][INFO] - [train] Step 28700 out of 65536 | Loss --> 2.828 | Grad_l2 --> 0.387 | Weights_l2 --> 15031.430 | Lr --> 0.015 | Seconds_per_step --> 0.333 | [2024-07-27 09:55:30,810][Main][INFO] - [train] Step 28800 out of 65536 | Loss --> 2.838 | Grad_l2 --> 0.417 | Weights_l2 --> 15046.210 | Lr --> 0.015 | Seconds_per_step --> 0.333 | [2024-07-27 09:56:04,122][Main][INFO] - [train] Step 28900 out of 65536 | Loss --> 2.824 | Grad_l2 --> 0.397 | Weights_l2 --> 15060.994 | Lr --> 0.015 | Seconds_per_step --> 0.333 | [2024-07-27 09:56:37,447][Main][INFO] - [train] Step 29000 out of 65536 | Loss --> 2.846 | Grad_l2 --> 0.405 | Weights_l2 --> 15075.823 | Lr --> 0.015 | Seconds_per_step --> 0.333 | [2024-07-27 09:57:10,830][Main][INFO] - [train] Step 29100 out of 65536 | Loss --> 2.832 | Grad_l2 --> 0.410 | Weights_l2 --> 15090.454 | Lr --> 0.015 | Seconds_per_step --> 0.334 | [2024-07-27 09:57:44,416][Main][INFO] - [train] Step 29200 out of 65536 | Loss --> 2.835 | Grad_l2 --> 0.398 | Weights_l2 --> 15105.294 | Lr --> 0.015 | Seconds_per_step --> 0.336 | [2024-07-27 09:58:17,928][Main][INFO] - [train] Step 29300 out of 65536 | Loss --> 2.839 | Grad_l2 --> 0.412 | Weights_l2 --> 15119.837 | Lr --> 0.015 | Seconds_per_step --> 0.335 | [2024-07-27 09:58:51,263][Main][INFO] - [train] Step 29400 out of 65536 | Loss --> 2.815 | Grad_l2 --> 0.399 | Weights_l2 --> 15134.175 | Lr --> 0.015 | Seconds_per_step --> 0.333 | [2024-07-27 09:59:24,874][Main][INFO] - [train] Step 29500 out of 65536 | Loss --> 2.818 | Grad_l2 --> 0.414 | Weights_l2 --> 15148.462 | Lr --> 0.015 | Seconds_per_step --> 0.336 | [2024-07-27 09:59:58,214][Main][INFO] - [train] Step 29600 out of 65536 | Loss --> 2.820 | Grad_l2 --> 0.412 | Weights_l2 --> 15163.219 | Lr --> 0.014 | Seconds_per_step --> 0.333 | [2024-07-27 10:00:31,515][Main][INFO] - [train] Step 29700 out of 65536 | Loss --> 2.830 | Grad_l2 --> 0.396 | Weights_l2 --> 15177.416 | Lr --> 0.014 | Seconds_per_step --> 0.333 | [2024-07-27 10:01:05,171][Main][INFO] - [train] Step 29800 out of 65536 | Loss --> 2.811 | Grad_l2 --> 0.410 | Weights_l2 --> 15191.412 | Lr --> 0.014 | Seconds_per_step --> 0.337 | [2024-07-27 10:01:38,495][Main][INFO] - [train] Step 29900 out of 65536 | Loss --> 2.816 | Grad_l2 --> 0.410 | Weights_l2 --> 15205.393 | Lr --> 0.014 | Seconds_per_step --> 0.333 | [2024-07-27 10:02:12,660][Main][INFO] - [train] Step 30000 out of 65536 | Loss --> 2.811 | Grad_l2 --> 0.409 | Weights_l2 --> 15219.376 | Lr --> 0.014 | Seconds_per_step --> 0.342 | [2024-07-27 10:02:12,660][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-30000 [2024-07-27 10:02:12,662][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 10:02:14,307][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-30000/model.safetensors [2024-07-27 10:02:16,182][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-30000/optimizer.bin [2024-07-27 10:02:16,182][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-30000/scheduler.bin [2024-07-27 10:02:16,182][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-30000/sampler.bin [2024-07-27 10:02:16,182][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-30000/sampler_1.bin [2024-07-27 10:02:16,183][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-30000/random_states_0.pkl [2024-07-27 10:02:49,548][Main][INFO] - [train] Step 30100 out of 65536 | Loss --> 2.818 | Grad_l2 --> 0.400 | Weights_l2 --> 15233.254 | Lr --> 0.014 | Seconds_per_step --> 0.369 | [2024-07-27 10:03:22,877][Main][INFO] - [train] Step 30200 out of 65536 | Loss --> 2.801 | Grad_l2 --> 0.415 | Weights_l2 --> 15247.017 | Lr --> 0.014 | Seconds_per_step --> 0.333 | [2024-07-27 10:03:56,187][Main][INFO] - [train] Step 30300 out of 65536 | Loss --> 2.814 | Grad_l2 --> 0.409 | Weights_l2 --> 15260.925 | Lr --> 0.014 | Seconds_per_step --> 0.333 | [2024-07-27 10:04:29,813][Main][INFO] - [train] Step 30400 out of 65536 | Loss --> 2.796 | Grad_l2 --> 0.406 | Weights_l2 --> 15274.580 | Lr --> 0.014 | Seconds_per_step --> 0.336 | [2024-07-27 10:05:03,138][Main][INFO] - [train] Step 30500 out of 65536 | Loss --> 2.808 | Grad_l2 --> 0.407 | Weights_l2 --> 15288.110 | Lr --> 0.014 | Seconds_per_step --> 0.333 | [2024-07-27 10:05:36,457][Main][INFO] - [train] Step 30600 out of 65536 | Loss --> 2.794 | Grad_l2 --> 0.418 | Weights_l2 --> 15301.567 | Lr --> 0.014 | Seconds_per_step --> 0.333 | [2024-07-27 10:06:10,596][Main][INFO] - [train] Step 30700 out of 65536 | Loss --> 2.799 | Grad_l2 --> 0.407 | Weights_l2 --> 15315.102 | Lr --> 0.014 | Seconds_per_step --> 0.341 | [2024-07-27 10:06:44,922][Main][INFO] - [train] Step 30800 out of 65536 | Loss --> 2.802 | Grad_l2 --> 0.399 | Weights_l2 --> 15328.468 | Lr --> 0.014 | Seconds_per_step --> 0.343 | [2024-07-27 10:07:19,673][Main][INFO] - [train] Step 30900 out of 65536 | Loss --> 2.787 | Grad_l2 --> 0.404 | Weights_l2 --> 15341.893 | Lr --> 0.014 | Seconds_per_step --> 0.348 | [2024-07-27 10:07:53,251][Main][INFO] - [train] Step 31000 out of 65536 | Loss --> 2.783 | Grad_l2 --> 0.405 | Weights_l2 --> 15354.991 | Lr --> 0.014 | Seconds_per_step --> 0.336 | [2024-07-27 10:08:26,522][Main][INFO] - [train] Step 31100 out of 65536 | Loss --> 2.788 | Grad_l2 --> 0.438 | Weights_l2 --> 15368.169 | Lr --> 0.014 | Seconds_per_step --> 0.333 | [2024-07-27 10:08:59,809][Main][INFO] - [train] Step 31200 out of 65536 | Loss --> 2.795 | Grad_l2 --> 0.400 | Weights_l2 --> 15381.035 | Lr --> 0.014 | Seconds_per_step --> 0.333 | [2024-07-27 10:09:33,091][Main][INFO] - [train] Step 31300 out of 65536 | Loss --> 2.783 | Grad_l2 --> 0.407 | Weights_l2 --> 15394.148 | Lr --> 0.014 | Seconds_per_step --> 0.333 | [2024-07-27 10:10:06,392][Main][INFO] - [train] Step 31400 out of 65536 | Loss --> 2.785 | Grad_l2 --> 0.395 | Weights_l2 --> 15406.643 | Lr --> 0.014 | Seconds_per_step --> 0.333 | [2024-07-27 10:10:39,693][Main][INFO] - [train] Step 31500 out of 65536 | Loss --> 2.783 | Grad_l2 --> 0.428 | Weights_l2 --> 15419.429 | Lr --> 0.013 | Seconds_per_step --> 0.333 | [2024-07-27 10:11:13,322][Main][INFO] - [train] Step 31600 out of 65536 | Loss --> 2.772 | Grad_l2 --> 0.417 | Weights_l2 --> 15432.022 | Lr --> 0.013 | Seconds_per_step --> 0.336 | [2024-07-27 10:11:46,623][Main][INFO] - [train] Step 31700 out of 65536 | Loss --> 2.779 | Grad_l2 --> 0.425 | Weights_l2 --> 15444.564 | Lr --> 0.013 | Seconds_per_step --> 0.333 | [2024-07-27 10:12:19,962][Main][INFO] - [train] Step 31800 out of 65536 | Loss --> 2.746 | Grad_l2 --> 0.411 | Weights_l2 --> 15457.094 | Lr --> 0.013 | Seconds_per_step --> 0.333 | [2024-07-27 10:12:53,326][Main][INFO] - [train] Step 31900 out of 65536 | Loss --> 2.770 | Grad_l2 --> 0.404 | Weights_l2 --> 15469.413 | Lr --> 0.013 | Seconds_per_step --> 0.334 | [2024-07-27 10:13:26,898][Main][INFO] - [train] Step 32000 out of 65536 | Loss --> 2.768 | Grad_l2 --> 0.411 | Weights_l2 --> 15481.668 | Lr --> 0.013 | Seconds_per_step --> 0.336 | [2024-07-27 10:14:00,155][Main][INFO] - [train] Step 32100 out of 65536 | Loss --> 2.758 | Grad_l2 --> 0.405 | Weights_l2 --> 15493.921 | Lr --> 0.013 | Seconds_per_step --> 0.333 | [2024-07-27 10:14:34,245][Main][INFO] - [train] Step 32200 out of 65536 | Loss --> 2.769 | Grad_l2 --> 0.391 | Weights_l2 --> 15505.930 | Lr --> 0.013 | Seconds_per_step --> 0.341 | [2024-07-27 10:15:07,596][Main][INFO] - [train] Step 32300 out of 65536 | Loss --> 2.761 | Grad_l2 --> 0.413 | Weights_l2 --> 15517.848 | Lr --> 0.013 | Seconds_per_step --> 0.334 | [2024-07-27 10:15:40,882][Main][INFO] - [train] Step 32400 out of 65536 | Loss --> 2.758 | Grad_l2 --> 0.403 | Weights_l2 --> 15529.776 | Lr --> 0.013 | Seconds_per_step --> 0.333 | [2024-07-27 10:16:14,178][Main][INFO] - [train] Step 32500 out of 65536 | Loss --> 2.756 | Grad_l2 --> 0.412 | Weights_l2 --> 15541.412 | Lr --> 0.013 | Seconds_per_step --> 0.333 | [2024-07-27 10:16:14,178][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-32500 [2024-07-27 10:16:14,180][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 10:16:15,852][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-32500/model.safetensors [2024-07-27 10:16:17,744][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-32500/optimizer.bin [2024-07-27 10:16:17,744][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-32500/scheduler.bin [2024-07-27 10:16:17,744][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-32500/sampler.bin [2024-07-27 10:16:17,744][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-32500/sampler_1.bin [2024-07-27 10:16:17,745][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-32500/random_states_0.pkl [2024-07-27 10:16:51,084][Main][INFO] - [train] Step 32600 out of 65536 | Loss --> 2.785 | Grad_l2 --> 0.395 | Weights_l2 --> 15553.184 | Lr --> 0.013 | Seconds_per_step --> 0.369 | [2024-07-27 10:17:24,365][Main][INFO] - [train] Step 32700 out of 65536 | Loss --> 2.753 | Grad_l2 --> 0.411 | Weights_l2 --> 15564.797 | Lr --> 0.013 | Seconds_per_step --> 0.333 | [2024-07-27 10:17:57,989][Main][INFO] - [train] Step 32800 out of 65536 | Loss --> 2.742 | Grad_l2 --> 0.427 | Weights_l2 --> 15576.336 | Lr --> 0.013 | Seconds_per_step --> 0.336 | [2024-07-27 10:18:31,722][Main][INFO] - [train] Step 32900 out of 65536 | Loss --> 2.749 | Grad_l2 --> 0.404 | Weights_l2 --> 15587.814 | Lr --> 0.013 | Seconds_per_step --> 0.337 | [2024-07-27 10:19:05,079][Main][INFO] - [train] Step 33000 out of 65536 | Loss --> 2.742 | Grad_l2 --> 0.393 | Weights_l2 --> 15599.178 | Lr --> 0.013 | Seconds_per_step --> 0.334 | [2024-07-27 10:19:38,393][Main][INFO] - [train] Step 33100 out of 65536 | Loss --> 2.753 | Grad_l2 --> 0.417 | Weights_l2 --> 15610.611 | Lr --> 0.013 | Seconds_per_step --> 0.333 | [2024-07-27 10:20:12,158][Main][INFO] - [train] Step 33200 out of 65536 | Loss --> 2.735 | Grad_l2 --> 0.422 | Weights_l2 --> 15621.841 | Lr --> 0.013 | Seconds_per_step --> 0.338 | [2024-07-27 10:20:45,475][Main][INFO] - [train] Step 33300 out of 65536 | Loss --> 2.741 | Grad_l2 --> 0.404 | Weights_l2 --> 15633.030 | Lr --> 0.013 | Seconds_per_step --> 0.333 | [2024-07-27 10:21:19,125][Main][INFO] - [train] Step 33400 out of 65536 | Loss --> 2.739 | Grad_l2 --> 0.417 | Weights_l2 --> 15644.569 | Lr --> 0.012 | Seconds_per_step --> 0.336 | [2024-07-27 10:21:52,420][Main][INFO] - [train] Step 33500 out of 65536 | Loss --> 2.740 | Grad_l2 --> 0.417 | Weights_l2 --> 15655.413 | Lr --> 0.012 | Seconds_per_step --> 0.333 | [2024-07-27 10:22:25,725][Main][INFO] - [train] Step 33600 out of 65536 | Loss --> 2.759 | Grad_l2 --> 0.426 | Weights_l2 --> 15666.501 | Lr --> 0.012 | Seconds_per_step --> 0.333 | [2024-07-27 10:22:59,305][Main][INFO] - [train] Step 33700 out of 65536 | Loss --> 2.730 | Grad_l2 --> 0.408 | Weights_l2 --> 15677.229 | Lr --> 0.012 | Seconds_per_step --> 0.336 | [2024-07-27 10:23:32,594][Main][INFO] - [train] Step 33800 out of 65536 | Loss --> 2.742 | Grad_l2 --> 0.416 | Weights_l2 --> 15687.937 | Lr --> 0.012 | Seconds_per_step --> 0.333 | [2024-07-27 10:24:06,312][Main][INFO] - [train] Step 33900 out of 65536 | Loss --> 2.735 | Grad_l2 --> 0.410 | Weights_l2 --> 15698.623 | Lr --> 0.012 | Seconds_per_step --> 0.337 | [2024-07-27 10:24:39,996][Main][INFO] - [train] Step 34000 out of 65536 | Loss --> 2.735 | Grad_l2 --> 0.404 | Weights_l2 --> 15709.017 | Lr --> 0.012 | Seconds_per_step --> 0.337 | [2024-07-27 10:25:13,610][Main][INFO] - [train] Step 34100 out of 65536 | Loss --> 2.736 | Grad_l2 --> 0.419 | Weights_l2 --> 15719.557 | Lr --> 0.012 | Seconds_per_step --> 0.336 | [2024-07-27 10:25:46,938][Main][INFO] - [train] Step 34200 out of 65536 | Loss --> 2.732 | Grad_l2 --> 0.430 | Weights_l2 --> 15729.913 | Lr --> 0.012 | Seconds_per_step --> 0.333 | [2024-07-27 10:26:21,490][Main][INFO] - [train] Step 34300 out of 65536 | Loss --> 2.732 | Grad_l2 --> 0.405 | Weights_l2 --> 15740.086 | Lr --> 0.012 | Seconds_per_step --> 0.346 | [2024-07-27 10:26:54,872][Main][INFO] - [train] Step 34400 out of 65536 | Loss --> 2.728 | Grad_l2 --> 0.419 | Weights_l2 --> 15750.072 | Lr --> 0.012 | Seconds_per_step --> 0.334 | [2024-07-27 10:27:28,167][Main][INFO] - [train] Step 34500 out of 65536 | Loss --> 2.719 | Grad_l2 --> 0.404 | Weights_l2 --> 15760.261 | Lr --> 0.012 | Seconds_per_step --> 0.333 | [2024-07-27 10:28:01,476][Main][INFO] - [train] Step 34600 out of 65536 | Loss --> 2.709 | Grad_l2 --> 0.415 | Weights_l2 --> 15770.154 | Lr --> 0.012 | Seconds_per_step --> 0.333 | [2024-07-27 10:28:35,118][Main][INFO] - [train] Step 34700 out of 65536 | Loss --> 2.722 | Grad_l2 --> 0.420 | Weights_l2 --> 15780.242 | Lr --> 0.012 | Seconds_per_step --> 0.336 | [2024-07-27 10:29:08,433][Main][INFO] - [train] Step 34800 out of 65536 | Loss --> 2.710 | Grad_l2 --> 0.420 | Weights_l2 --> 15790.149 | Lr --> 0.012 | Seconds_per_step --> 0.333 | [2024-07-27 10:29:41,770][Main][INFO] - [train] Step 34900 out of 65536 | Loss --> 2.709 | Grad_l2 --> 0.413 | Weights_l2 --> 15799.912 | Lr --> 0.012 | Seconds_per_step --> 0.333 | [2024-07-27 10:30:15,113][Main][INFO] - [train] Step 35000 out of 65536 | Loss --> 2.728 | Grad_l2 --> 0.416 | Weights_l2 --> 15809.457 | Lr --> 0.012 | Seconds_per_step --> 0.333 | [2024-07-27 10:30:15,114][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-35000 [2024-07-27 10:30:15,116][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 10:30:16,764][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-35000/model.safetensors [2024-07-27 10:30:18,691][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-35000/optimizer.bin [2024-07-27 10:30:18,691][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-35000/scheduler.bin [2024-07-27 10:30:18,692][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-35000/sampler.bin [2024-07-27 10:30:18,692][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-35000/sampler_1.bin [2024-07-27 10:30:18,692][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-35000/random_states_0.pkl [2024-07-27 10:30:52,126][Main][INFO] - [train] Step 35100 out of 65536 | Loss --> 2.706 | Grad_l2 --> 0.410 | Weights_l2 --> 15819.002 | Lr --> 0.012 | Seconds_per_step --> 0.370 | [2024-07-27 10:31:26,958][Main][INFO] - [train] Step 35200 out of 65536 | Loss --> 2.712 | Grad_l2 --> 0.419 | Weights_l2 --> 15828.451 | Lr --> 0.011 | Seconds_per_step --> 0.348 | [2024-07-27 10:32:01,139][Main][INFO] - [train] Step 35300 out of 65536 | Loss --> 2.720 | Grad_l2 --> 0.415 | Weights_l2 --> 15837.754 | Lr --> 0.011 | Seconds_per_step --> 0.342 | [2024-07-27 10:32:34,649][Main][INFO] - [train] Step 35400 out of 65536 | Loss --> 2.700 | Grad_l2 --> 0.422 | Weights_l2 --> 15847.090 | Lr --> 0.011 | Seconds_per_step --> 0.335 | [2024-07-27 10:33:08,180][Main][INFO] - [train] Step 35500 out of 65536 | Loss --> 2.714 | Grad_l2 --> 0.418 | Weights_l2 --> 15856.407 | Lr --> 0.011 | Seconds_per_step --> 0.335 | [2024-07-27 10:33:41,670][Main][INFO] - [train] Step 35600 out of 65536 | Loss --> 2.696 | Grad_l2 --> 0.439 | Weights_l2 --> 15865.720 | Lr --> 0.011 | Seconds_per_step --> 0.335 | [2024-07-27 10:34:14,974][Main][INFO] - [train] Step 35700 out of 65536 | Loss --> 2.698 | Grad_l2 --> 0.414 | Weights_l2 --> 15874.810 | Lr --> 0.011 | Seconds_per_step --> 0.333 | [2024-07-27 10:34:48,433][Main][INFO] - [train] Step 35800 out of 65536 | Loss --> 2.703 | Grad_l2 --> 0.414 | Weights_l2 --> 15883.800 | Lr --> 0.011 | Seconds_per_step --> 0.335 | [2024-07-27 10:35:22,272][Main][INFO] - [train] Step 35900 out of 65536 | Loss --> 2.691 | Grad_l2 --> 0.408 | Weights_l2 --> 15892.722 | Lr --> 0.011 | Seconds_per_step --> 0.338 | [2024-07-27 10:35:55,780][Main][INFO] - [train] Step 36000 out of 65536 | Loss --> 2.680 | Grad_l2 --> 0.406 | Weights_l2 --> 15901.550 | Lr --> 0.011 | Seconds_per_step --> 0.335 | [2024-07-27 10:36:29,219][Main][INFO] - [train] Step 36100 out of 65536 | Loss --> 2.681 | Grad_l2 --> 0.416 | Weights_l2 --> 15910.211 | Lr --> 0.011 | Seconds_per_step --> 0.334 | [2024-07-27 10:37:02,458][Main][INFO] - [train] Step 36200 out of 65536 | Loss --> 2.694 | Grad_l2 --> 0.491 | Weights_l2 --> 15919.485 | Lr --> 0.011 | Seconds_per_step --> 0.332 | [2024-07-27 10:37:35,876][Main][INFO] - [train] Step 36300 out of 65536 | Loss --> 2.688 | Grad_l2 --> 0.419 | Weights_l2 --> 15928.177 | Lr --> 0.011 | Seconds_per_step --> 0.334 | [2024-07-27 10:38:09,322][Main][INFO] - [train] Step 36400 out of 65536 | Loss --> 2.679 | Grad_l2 --> 0.420 | Weights_l2 --> 15936.515 | Lr --> 0.011 | Seconds_per_step --> 0.334 | [2024-07-27 10:38:42,930][Main][INFO] - [train] Step 36500 out of 65536 | Loss --> 2.690 | Grad_l2 --> 0.414 | Weights_l2 --> 15944.895 | Lr --> 0.011 | Seconds_per_step --> 0.336 | [2024-07-27 10:39:16,531][Main][INFO] - [train] Step 36600 out of 65536 | Loss --> 2.679 | Grad_l2 --> 0.419 | Weights_l2 --> 15953.168 | Lr --> 0.011 | Seconds_per_step --> 0.336 | [2024-07-27 10:39:49,965][Main][INFO] - [train] Step 36700 out of 65536 | Loss --> 2.677 | Grad_l2 --> 0.424 | Weights_l2 --> 15961.349 | Lr --> 0.011 | Seconds_per_step --> 0.334 | [2024-07-27 10:40:23,459][Main][INFO] - [train] Step 36800 out of 65536 | Loss --> 2.668 | Grad_l2 --> 0.420 | Weights_l2 --> 15969.461 | Lr --> 0.011 | Seconds_per_step --> 0.335 | [2024-07-27 10:40:56,866][Main][INFO] - [train] Step 36900 out of 65536 | Loss --> 2.671 | Grad_l2 --> 0.422 | Weights_l2 --> 15977.353 | Lr --> 0.010 | Seconds_per_step --> 0.334 | [2024-07-27 10:41:30,129][Main][INFO] - [train] Step 37000 out of 65536 | Loss --> 2.675 | Grad_l2 --> 0.418 | Weights_l2 --> 15985.271 | Lr --> 0.010 | Seconds_per_step --> 0.333 | [2024-07-27 10:42:03,723][Main][INFO] - [train] Step 37100 out of 65536 | Loss --> 2.673 | Grad_l2 --> 0.411 | Weights_l2 --> 15993.054 | Lr --> 0.010 | Seconds_per_step --> 0.336 | [2024-07-27 10:42:37,275][Main][INFO] - [train] Step 37200 out of 65536 | Loss --> 2.669 | Grad_l2 --> 0.410 | Weights_l2 --> 16000.813 | Lr --> 0.010 | Seconds_per_step --> 0.336 | [2024-07-27 10:43:10,644][Main][INFO] - [train] Step 37300 out of 65536 | Loss --> 2.666 | Grad_l2 --> 0.426 | Weights_l2 --> 16008.523 | Lr --> 0.010 | Seconds_per_step --> 0.334 | [2024-07-27 10:43:43,867][Main][INFO] - [train] Step 37400 out of 65536 | Loss --> 2.666 | Grad_l2 --> 0.415 | Weights_l2 --> 16016.119 | Lr --> 0.010 | Seconds_per_step --> 0.332 | [2024-07-27 10:44:17,166][Main][INFO] - [train] Step 37500 out of 65536 | Loss --> 2.672 | Grad_l2 --> 0.410 | Weights_l2 --> 16023.545 | Lr --> 0.010 | Seconds_per_step --> 0.333 | [2024-07-27 10:44:17,167][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-37500 [2024-07-27 10:44:17,169][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 10:44:18,797][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-37500/model.safetensors [2024-07-27 10:44:20,675][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-37500/optimizer.bin [2024-07-27 10:44:20,676][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-37500/scheduler.bin [2024-07-27 10:44:20,676][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-37500/sampler.bin [2024-07-27 10:44:20,676][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-37500/sampler_1.bin [2024-07-27 10:44:20,676][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-37500/random_states_0.pkl [2024-07-27 10:44:54,019][Main][INFO] - [train] Step 37600 out of 65536 | Loss --> 2.666 | Grad_l2 --> 0.413 | Weights_l2 --> 16031.029 | Lr --> 0.010 | Seconds_per_step --> 0.369 | [2024-07-27 10:45:27,807][Main][INFO] - [train] Step 37700 out of 65536 | Loss --> 2.669 | Grad_l2 --> 0.427 | Weights_l2 --> 16038.461 | Lr --> 0.010 | Seconds_per_step --> 0.338 | [2024-07-27 10:46:01,977][Main][INFO] - [train] Step 37800 out of 65536 | Loss --> 2.663 | Grad_l2 --> 0.418 | Weights_l2 --> 16045.852 | Lr --> 0.010 | Seconds_per_step --> 0.342 | [2024-07-27 10:46:36,191][Main][INFO] - [train] Step 37900 out of 65536 | Loss --> 2.661 | Grad_l2 --> 0.413 | Weights_l2 --> 16053.085 | Lr --> 0.010 | Seconds_per_step --> 0.342 | [2024-07-27 10:47:09,459][Main][INFO] - [train] Step 38000 out of 65536 | Loss --> 2.667 | Grad_l2 --> 0.426 | Weights_l2 --> 16060.238 | Lr --> 0.010 | Seconds_per_step --> 0.333 | [2024-07-27 10:47:42,754][Main][INFO] - [train] Step 38100 out of 65536 | Loss --> 2.658 | Grad_l2 --> 0.407 | Weights_l2 --> 16067.115 | Lr --> 0.010 | Seconds_per_step --> 0.333 | [2024-07-27 10:48:16,052][Main][INFO] - [train] Step 38200 out of 65536 | Loss --> 2.646 | Grad_l2 --> 0.415 | Weights_l2 --> 16074.097 | Lr --> 0.010 | Seconds_per_step --> 0.333 | [2024-07-27 10:48:49,638][Main][INFO] - [train] Step 38300 out of 65536 | Loss --> 2.647 | Grad_l2 --> 0.430 | Weights_l2 --> 16081.029 | Lr --> 0.010 | Seconds_per_step --> 0.336 | [2024-07-27 10:49:22,906][Main][INFO] - [train] Step 38400 out of 65536 | Loss --> 2.651 | Grad_l2 --> 0.420 | Weights_l2 --> 16087.730 | Lr --> 0.010 | Seconds_per_step --> 0.333 | [2024-07-27 10:49:56,184][Main][INFO] - [train] Step 38500 out of 65536 | Loss --> 2.647 | Grad_l2 --> 0.411 | Weights_l2 --> 16094.558 | Lr --> 0.010 | Seconds_per_step --> 0.333 | [2024-07-27 10:50:29,472][Main][INFO] - [train] Step 38600 out of 65536 | Loss --> 2.649 | Grad_l2 --> 0.419 | Weights_l2 --> 16101.279 | Lr --> 0.010 | Seconds_per_step --> 0.333 | [2024-07-27 10:51:03,331][Main][INFO] - [train] Step 38700 out of 65536 | Loss --> 2.653 | Grad_l2 --> 0.422 | Weights_l2 --> 16107.846 | Lr --> 0.009 | Seconds_per_step --> 0.339 | [2024-07-27 10:51:37,264][Main][INFO] - [train] Step 38800 out of 65536 | Loss --> 2.644 | Grad_l2 --> 0.414 | Weights_l2 --> 16114.350 | Lr --> 0.009 | Seconds_per_step --> 0.339 | [2024-07-27 10:52:10,843][Main][INFO] - [train] Step 38900 out of 65536 | Loss --> 2.642 | Grad_l2 --> 0.414 | Weights_l2 --> 16120.792 | Lr --> 0.009 | Seconds_per_step --> 0.336 | [2024-07-27 10:52:44,125][Main][INFO] - [train] Step 39000 out of 65536 | Loss --> 2.644 | Grad_l2 --> 0.419 | Weights_l2 --> 16127.154 | Lr --> 0.009 | Seconds_per_step --> 0.333 | [2024-07-27 10:53:17,396][Main][INFO] - [train] Step 39100 out of 65536 | Loss --> 2.649 | Grad_l2 --> 0.416 | Weights_l2 --> 16133.505 | Lr --> 0.009 | Seconds_per_step --> 0.333 | [2024-07-27 10:53:50,692][Main][INFO] - [train] Step 39200 out of 65536 | Loss --> 2.641 | Grad_l2 --> 0.416 | Weights_l2 --> 16139.658 | Lr --> 0.009 | Seconds_per_step --> 0.333 | [2024-07-27 10:54:23,957][Main][INFO] - [train] Step 39300 out of 65536 | Loss --> 2.638 | Grad_l2 --> 0.422 | Weights_l2 --> 16145.695 | Lr --> 0.009 | Seconds_per_step --> 0.333 | [2024-07-27 10:54:58,425][Main][INFO] - [train] Step 39400 out of 65536 | Loss --> 2.633 | Grad_l2 --> 0.420 | Weights_l2 --> 16151.798 | Lr --> 0.009 | Seconds_per_step --> 0.345 | [2024-07-27 10:55:32,631][Main][INFO] - [train] Step 39500 out of 65536 | Loss --> 2.626 | Grad_l2 --> 0.413 | Weights_l2 --> 16157.777 | Lr --> 0.009 | Seconds_per_step --> 0.342 | [2024-07-27 10:56:08,043][Main][INFO] - [train] Step 39600 out of 65536 | Loss --> 2.638 | Grad_l2 --> 0.420 | Weights_l2 --> 16163.723 | Lr --> 0.009 | Seconds_per_step --> 0.354 | [2024-07-27 10:56:42,237][Main][INFO] - [train] Step 39700 out of 65536 | Loss --> 2.617 | Grad_l2 --> 0.423 | Weights_l2 --> 16169.626 | Lr --> 0.009 | Seconds_per_step --> 0.342 | [2024-07-27 10:57:16,866][Main][INFO] - [train] Step 39800 out of 65536 | Loss --> 2.625 | Grad_l2 --> 0.414 | Weights_l2 --> 16175.306 | Lr --> 0.009 | Seconds_per_step --> 0.346 | [2024-07-27 10:57:50,148][Main][INFO] - [train] Step 39900 out of 65536 | Loss --> 2.618 | Grad_l2 --> 0.413 | Weights_l2 --> 16180.978 | Lr --> 0.009 | Seconds_per_step --> 0.333 | [2024-07-27 10:58:23,459][Main][INFO] - [train] Step 40000 out of 65536 | Loss --> 2.614 | Grad_l2 --> 0.418 | Weights_l2 --> 16186.567 | Lr --> 0.009 | Seconds_per_step --> 0.333 | [2024-07-27 10:58:23,459][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-40000 [2024-07-27 10:58:23,461][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 10:58:25,133][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-40000/model.safetensors [2024-07-27 10:58:27,006][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-40000/optimizer.bin [2024-07-27 10:58:27,006][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-40000/scheduler.bin [2024-07-27 10:58:27,006][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-40000/sampler.bin [2024-07-27 10:58:27,006][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-40000/sampler_1.bin [2024-07-27 10:58:27,007][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-40000/random_states_0.pkl [2024-07-27 10:59:00,332][Main][INFO] - [train] Step 40100 out of 65536 | Loss --> 2.609 | Grad_l2 --> 0.417 | Weights_l2 --> 16192.002 | Lr --> 0.009 | Seconds_per_step --> 0.369 | [2024-07-27 10:59:33,940][Main][INFO] - [train] Step 40200 out of 65536 | Loss --> 2.623 | Grad_l2 --> 0.417 | Weights_l2 --> 16197.469 | Lr --> 0.009 | Seconds_per_step --> 0.336 | [2024-07-27 11:00:07,237][Main][INFO] - [train] Step 40300 out of 65536 | Loss --> 2.621 | Grad_l2 --> 0.417 | Weights_l2 --> 16202.895 | Lr --> 0.009 | Seconds_per_step --> 0.333 | [2024-07-27 11:00:40,537][Main][INFO] - [train] Step 40400 out of 65536 | Loss --> 2.617 | Grad_l2 --> 0.429 | Weights_l2 --> 16208.160 | Lr --> 0.009 | Seconds_per_step --> 0.333 | [2024-07-27 11:01:13,827][Main][INFO] - [train] Step 40500 out of 65536 | Loss --> 2.594 | Grad_l2 --> 0.416 | Weights_l2 --> 16213.367 | Lr --> 0.008 | Seconds_per_step --> 0.333 | [2024-07-27 11:01:47,109][Main][INFO] - [train] Step 40600 out of 65536 | Loss --> 2.609 | Grad_l2 --> 0.412 | Weights_l2 --> 16218.407 | Lr --> 0.008 | Seconds_per_step --> 0.333 | [2024-07-27 11:02:20,396][Main][INFO] - [train] Step 40700 out of 65536 | Loss --> 2.593 | Grad_l2 --> 0.411 | Weights_l2 --> 16223.442 | Lr --> 0.008 | Seconds_per_step --> 0.333 | [2024-07-27 11:02:53,980][Main][INFO] - [train] Step 40800 out of 65536 | Loss --> 2.605 | Grad_l2 --> 0.414 | Weights_l2 --> 16228.494 | Lr --> 0.008 | Seconds_per_step --> 0.336 | [2024-07-27 11:03:27,285][Main][INFO] - [train] Step 40900 out of 65536 | Loss --> 2.599 | Grad_l2 --> 0.420 | Weights_l2 --> 16233.436 | Lr --> 0.008 | Seconds_per_step --> 0.333 | [2024-07-27 11:04:00,564][Main][INFO] - [train] Step 41000 out of 65536 | Loss --> 2.596 | Grad_l2 --> 0.421 | Weights_l2 --> 16238.292 | Lr --> 0.008 | Seconds_per_step --> 0.333 | [2024-07-27 11:04:33,856][Main][INFO] - [train] Step 41100 out of 65536 | Loss --> 2.601 | Grad_l2 --> 0.420 | Weights_l2 --> 16243.092 | Lr --> 0.008 | Seconds_per_step --> 0.333 | [2024-07-27 11:05:07,146][Main][INFO] - [train] Step 41200 out of 65536 | Loss --> 2.589 | Grad_l2 --> 0.417 | Weights_l2 --> 16247.846 | Lr --> 0.008 | Seconds_per_step --> 0.333 | [2024-07-27 11:05:40,458][Main][INFO] - [train] Step 41300 out of 65536 | Loss --> 2.609 | Grad_l2 --> 0.422 | Weights_l2 --> 16252.616 | Lr --> 0.008 | Seconds_per_step --> 0.333 | [2024-07-27 11:06:14,051][Main][INFO] - [train] Step 41400 out of 65536 | Loss --> 2.583 | Grad_l2 --> 0.416 | Weights_l2 --> 16257.148 | Lr --> 0.008 | Seconds_per_step --> 0.336 | [2024-07-27 11:06:47,594][Main][INFO] - [train] Step 41500 out of 65536 | Loss --> 2.585 | Grad_l2 --> 0.416 | Weights_l2 --> 16261.702 | Lr --> 0.008 | Seconds_per_step --> 0.335 | [2024-07-27 11:07:22,743][Main][INFO] - [train] Step 41600 out of 65536 | Loss --> 2.597 | Grad_l2 --> 0.420 | Weights_l2 --> 16266.108 | Lr --> 0.008 | Seconds_per_step --> 0.351 | [2024-07-27 11:07:56,285][Main][INFO] - [train] Step 41700 out of 65536 | Loss --> 2.587 | Grad_l2 --> 0.426 | Weights_l2 --> 16270.528 | Lr --> 0.008 | Seconds_per_step --> 0.335 | [2024-07-27 11:08:29,515][Main][INFO] - [train] Step 41800 out of 65536 | Loss --> 2.591 | Grad_l2 --> 0.427 | Weights_l2 --> 16274.937 | Lr --> 0.008 | Seconds_per_step --> 0.332 | [2024-07-27 11:09:02,742][Main][INFO] - [train] Step 41900 out of 65536 | Loss --> 2.590 | Grad_l2 --> 0.415 | Weights_l2 --> 16279.164 | Lr --> 0.008 | Seconds_per_step --> 0.332 | [2024-07-27 11:09:36,320][Main][INFO] - [train] Step 42000 out of 65536 | Loss --> 2.584 | Grad_l2 --> 0.419 | Weights_l2 --> 16283.385 | Lr --> 0.008 | Seconds_per_step --> 0.336 | [2024-07-27 11:10:09,635][Main][INFO] - [train] Step 42100 out of 65536 | Loss --> 2.592 | Grad_l2 --> 0.416 | Weights_l2 --> 16287.568 | Lr --> 0.008 | Seconds_per_step --> 0.333 | [2024-07-27 11:10:42,957][Main][INFO] - [train] Step 42200 out of 65536 | Loss --> 2.590 | Grad_l2 --> 0.472 | Weights_l2 --> 16291.667 | Lr --> 0.008 | Seconds_per_step --> 0.333 | [2024-07-27 11:11:16,276][Main][INFO] - [train] Step 42300 out of 65536 | Loss --> 2.586 | Grad_l2 --> 0.415 | Weights_l2 --> 16295.681 | Lr --> 0.007 | Seconds_per_step --> 0.333 | [2024-07-27 11:11:49,587][Main][INFO] - [train] Step 42400 out of 65536 | Loss --> 2.586 | Grad_l2 --> 0.425 | Weights_l2 --> 16299.649 | Lr --> 0.007 | Seconds_per_step --> 0.333 | [2024-07-27 11:12:22,899][Main][INFO] - [train] Step 42500 out of 65536 | Loss --> 2.587 | Grad_l2 --> 0.427 | Weights_l2 --> 16303.552 | Lr --> 0.007 | Seconds_per_step --> 0.333 | [2024-07-27 11:12:22,900][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-42500 [2024-07-27 11:12:22,902][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 11:12:24,562][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-42500/model.safetensors [2024-07-27 11:12:26,466][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-42500/optimizer.bin [2024-07-27 11:12:26,466][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-42500/scheduler.bin [2024-07-27 11:12:26,466][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-42500/sampler.bin [2024-07-27 11:12:26,466][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-42500/sampler_1.bin [2024-07-27 11:12:26,467][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-42500/random_states_0.pkl [2024-07-27 11:13:00,102][Main][INFO] - [train] Step 42600 out of 65536 | Loss --> 2.554 | Grad_l2 --> 0.411 | Weights_l2 --> 16307.351 | Lr --> 0.007 | Seconds_per_step --> 0.372 | [2024-07-27 11:13:33,390][Main][INFO] - [train] Step 42700 out of 65536 | Loss --> 2.562 | Grad_l2 --> 0.423 | Weights_l2 --> 16311.212 | Lr --> 0.007 | Seconds_per_step --> 0.333 | [2024-07-27 11:14:06,674][Main][INFO] - [train] Step 42800 out of 65536 | Loss --> 2.578 | Grad_l2 --> 0.413 | Weights_l2 --> 16314.935 | Lr --> 0.007 | Seconds_per_step --> 0.333 | [2024-07-27 11:14:39,969][Main][INFO] - [train] Step 42900 out of 65536 | Loss --> 2.579 | Grad_l2 --> 0.418 | Weights_l2 --> 16318.592 | Lr --> 0.007 | Seconds_per_step --> 0.333 | [2024-07-27 11:15:13,245][Main][INFO] - [train] Step 43000 out of 65536 | Loss --> 2.573 | Grad_l2 --> 0.416 | Weights_l2 --> 16322.187 | Lr --> 0.007 | Seconds_per_step --> 0.333 | [2024-07-27 11:15:46,542][Main][INFO] - [train] Step 43100 out of 65536 | Loss --> 2.569 | Grad_l2 --> 0.411 | Weights_l2 --> 16325.640 | Lr --> 0.007 | Seconds_per_step --> 0.333 | [2024-07-27 11:16:20,119][Main][INFO] - [train] Step 43200 out of 65536 | Loss --> 2.555 | Grad_l2 --> 0.417 | Weights_l2 --> 16329.087 | Lr --> 0.007 | Seconds_per_step --> 0.336 | [2024-07-27 11:16:53,407][Main][INFO] - [train] Step 43300 out of 65536 | Loss --> 2.560 | Grad_l2 --> 0.409 | Weights_l2 --> 16332.477 | Lr --> 0.007 | Seconds_per_step --> 0.333 | [2024-07-27 11:17:26,682][Main][INFO] - [train] Step 43400 out of 65536 | Loss --> 2.566 | Grad_l2 --> 0.412 | Weights_l2 --> 16335.856 | Lr --> 0.007 | Seconds_per_step --> 0.333 | [2024-07-27 11:17:59,967][Main][INFO] - [train] Step 43500 out of 65536 | Loss --> 2.556 | Grad_l2 --> 0.417 | Weights_l2 --> 16339.196 | Lr --> 0.007 | Seconds_per_step --> 0.333 | [2024-07-27 11:18:33,468][Main][INFO] - [train] Step 43600 out of 65536 | Loss --> 2.561 | Grad_l2 --> 0.420 | Weights_l2 --> 16342.444 | Lr --> 0.007 | Seconds_per_step --> 0.335 | [2024-07-27 11:19:08,408][Main][INFO] - [train] Step 43700 out of 65536 | Loss --> 2.554 | Grad_l2 --> 0.420 | Weights_l2 --> 16345.644 | Lr --> 0.007 | Seconds_per_step --> 0.349 | [2024-07-27 11:19:41,998][Main][INFO] - [train] Step 43800 out of 65536 | Loss --> 2.564 | Grad_l2 --> 0.419 | Weights_l2 --> 16348.774 | Lr --> 0.007 | Seconds_per_step --> 0.336 | [2024-07-27 11:20:15,272][Main][INFO] - [train] Step 43900 out of 65536 | Loss --> 2.522 | Grad_l2 --> 0.408 | Weights_l2 --> 16351.878 | Lr --> 0.007 | Seconds_per_step --> 0.333 | [2024-07-27 11:20:48,523][Main][INFO] - [train] Step 44000 out of 65536 | Loss --> 2.551 | Grad_l2 --> 0.417 | Weights_l2 --> 16354.947 | Lr --> 0.007 | Seconds_per_step --> 0.333 | [2024-07-27 11:21:21,812][Main][INFO] - [train] Step 44100 out of 65536 | Loss --> 2.546 | Grad_l2 --> 0.419 | Weights_l2 --> 16357.925 | Lr --> 0.007 | Seconds_per_step --> 0.333 | [2024-07-27 11:21:55,071][Main][INFO] - [train] Step 44200 out of 65536 | Loss --> 2.551 | Grad_l2 --> 0.418 | Weights_l2 --> 16360.852 | Lr --> 0.006 | Seconds_per_step --> 0.333 | [2024-07-27 11:22:28,350][Main][INFO] - [train] Step 44300 out of 65536 | Loss --> 2.538 | Grad_l2 --> 0.426 | Weights_l2 --> 16363.822 | Lr --> 0.006 | Seconds_per_step --> 0.333 | [2024-07-27 11:23:01,928][Main][INFO] - [train] Step 44400 out of 65536 | Loss --> 2.549 | Grad_l2 --> 0.422 | Weights_l2 --> 16366.645 | Lr --> 0.006 | Seconds_per_step --> 0.336 | [2024-07-27 11:23:35,215][Main][INFO] - [train] Step 44500 out of 65536 | Loss --> 2.540 | Grad_l2 --> 0.421 | Weights_l2 --> 16369.436 | Lr --> 0.006 | Seconds_per_step --> 0.333 | [2024-07-27 11:24:08,476][Main][INFO] - [train] Step 44600 out of 65536 | Loss --> 2.543 | Grad_l2 --> 0.412 | Weights_l2 --> 16372.098 | Lr --> 0.006 | Seconds_per_step --> 0.333 | [2024-07-27 11:24:41,789][Main][INFO] - [train] Step 44700 out of 65536 | Loss --> 2.551 | Grad_l2 --> 0.408 | Weights_l2 --> 16374.729 | Lr --> 0.006 | Seconds_per_step --> 0.333 | [2024-07-27 11:25:15,052][Main][INFO] - [train] Step 44800 out of 65536 | Loss --> 2.538 | Grad_l2 --> 0.407 | Weights_l2 --> 16377.333 | Lr --> 0.006 | Seconds_per_step --> 0.333 | [2024-07-27 11:25:48,379][Main][INFO] - [train] Step 44900 out of 65536 | Loss --> 2.538 | Grad_l2 --> 0.422 | Weights_l2 --> 16379.860 | Lr --> 0.006 | Seconds_per_step --> 0.333 | [2024-07-27 11:26:21,987][Main][INFO] - [train] Step 45000 out of 65536 | Loss --> 2.523 | Grad_l2 --> 0.408 | Weights_l2 --> 16382.387 | Lr --> 0.006 | Seconds_per_step --> 0.336 | [2024-07-27 11:26:21,987][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-45000 [2024-07-27 11:26:21,989][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 11:26:23,640][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-45000/model.safetensors [2024-07-27 11:26:25,438][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-45000/optimizer.bin [2024-07-27 11:26:25,438][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-45000/scheduler.bin [2024-07-27 11:26:25,438][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-45000/sampler.bin [2024-07-27 11:26:25,439][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-45000/sampler_1.bin [2024-07-27 11:26:25,439][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-45000/random_states_0.pkl [2024-07-27 11:26:58,791][Main][INFO] - [train] Step 45100 out of 65536 | Loss --> 2.542 | Grad_l2 --> 0.409 | Weights_l2 --> 16384.874 | Lr --> 0.006 | Seconds_per_step --> 0.368 | [2024-07-27 11:27:32,075][Main][INFO] - [train] Step 45200 out of 65536 | Loss --> 2.527 | Grad_l2 --> 0.418 | Weights_l2 --> 16387.332 | Lr --> 0.006 | Seconds_per_step --> 0.333 | [2024-07-27 11:28:05,390][Main][INFO] - [train] Step 45300 out of 65536 | Loss --> 2.536 | Grad_l2 --> 0.417 | Weights_l2 --> 16389.692 | Lr --> 0.006 | Seconds_per_step --> 0.333 | [2024-07-27 11:28:38,748][Main][INFO] - [train] Step 45400 out of 65536 | Loss --> 2.542 | Grad_l2 --> 0.415 | Weights_l2 --> 16391.992 | Lr --> 0.006 | Seconds_per_step --> 0.334 | [2024-07-27 11:29:12,015][Main][INFO] - [train] Step 45500 out of 65536 | Loss --> 2.535 | Grad_l2 --> 0.416 | Weights_l2 --> 16394.258 | Lr --> 0.006 | Seconds_per_step --> 0.333 | [2024-07-27 11:29:45,586][Main][INFO] - [train] Step 45600 out of 65536 | Loss --> 2.535 | Grad_l2 --> 0.402 | Weights_l2 --> 16396.441 | Lr --> 0.006 | Seconds_per_step --> 0.336 | [2024-07-27 11:30:18,894][Main][INFO] - [train] Step 45700 out of 65536 | Loss --> 2.513 | Grad_l2 --> 0.413 | Weights_l2 --> 16398.672 | Lr --> 0.006 | Seconds_per_step --> 0.333 | [2024-07-27 11:30:52,299][Main][INFO] - [train] Step 45800 out of 65536 | Loss --> 2.527 | Grad_l2 --> 0.420 | Weights_l2 --> 16400.828 | Lr --> 0.006 | Seconds_per_step --> 0.334 | [2024-07-27 11:31:25,939][Main][INFO] - [train] Step 45900 out of 65536 | Loss --> 2.536 | Grad_l2 --> 0.418 | Weights_l2 --> 16402.950 | Lr --> 0.006 | Seconds_per_step --> 0.336 | [2024-07-27 11:31:59,557][Main][INFO] - [train] Step 46000 out of 65536 | Loss --> 2.538 | Grad_l2 --> 0.413 | Weights_l2 --> 16404.985 | Lr --> 0.006 | Seconds_per_step --> 0.336 | [2024-07-27 11:32:32,891][Main][INFO] - [train] Step 46100 out of 65536 | Loss --> 2.528 | Grad_l2 --> 0.417 | Weights_l2 --> 16406.958 | Lr --> 0.005 | Seconds_per_step --> 0.333 | [2024-07-27 11:33:06,167][Main][INFO] - [train] Step 46200 out of 65536 | Loss --> 2.515 | Grad_l2 --> 0.410 | Weights_l2 --> 16408.985 | Lr --> 0.005 | Seconds_per_step --> 0.333 | [2024-07-27 11:33:39,831][Main][INFO] - [train] Step 46300 out of 65536 | Loss --> 2.514 | Grad_l2 --> 0.415 | Weights_l2 --> 16410.869 | Lr --> 0.005 | Seconds_per_step --> 0.337 | [2024-07-27 11:34:13,143][Main][INFO] - [train] Step 46400 out of 65536 | Loss --> 2.513 | Grad_l2 --> 0.412 | Weights_l2 --> 16412.772 | Lr --> 0.005 | Seconds_per_step --> 0.333 | [2024-07-27 11:34:46,442][Main][INFO] - [train] Step 46500 out of 65536 | Loss --> 2.514 | Grad_l2 --> 0.416 | Weights_l2 --> 16414.629 | Lr --> 0.005 | Seconds_per_step --> 0.333 | [2024-07-27 11:35:20,588][Main][INFO] - [train] Step 46600 out of 65536 | Loss --> 2.519 | Grad_l2 --> 0.421 | Weights_l2 --> 16416.407 | Lr --> 0.005 | Seconds_per_step --> 0.341 | [2024-07-27 11:35:55,060][Main][INFO] - [train] Step 46700 out of 65536 | Loss --> 2.514 | Grad_l2 --> 0.413 | Weights_l2 --> 16418.158 | Lr --> 0.005 | Seconds_per_step --> 0.345 | [2024-07-27 11:36:28,421][Main][INFO] - [train] Step 46800 out of 65536 | Loss --> 2.506 | Grad_l2 --> 0.406 | Weights_l2 --> 16419.844 | Lr --> 0.005 | Seconds_per_step --> 0.334 | [2024-07-27 11:37:02,023][Main][INFO] - [train] Step 46900 out of 65536 | Loss --> 2.511 | Grad_l2 --> 0.414 | Weights_l2 --> 16421.556 | Lr --> 0.005 | Seconds_per_step --> 0.336 | [2024-07-27 11:37:35,311][Main][INFO] - [train] Step 47000 out of 65536 | Loss --> 2.500 | Grad_l2 --> 0.414 | Weights_l2 --> 16423.165 | Lr --> 0.005 | Seconds_per_step --> 0.333 | [2024-07-27 11:38:08,612][Main][INFO] - [train] Step 47100 out of 65536 | Loss --> 2.498 | Grad_l2 --> 0.405 | Weights_l2 --> 16424.769 | Lr --> 0.005 | Seconds_per_step --> 0.333 | [2024-07-27 11:38:41,918][Main][INFO] - [train] Step 47200 out of 65536 | Loss --> 2.499 | Grad_l2 --> 0.411 | Weights_l2 --> 16426.327 | Lr --> 0.005 | Seconds_per_step --> 0.333 | [2024-07-27 11:39:16,877][Main][INFO] - [train] Step 47300 out of 65536 | Loss --> 2.494 | Grad_l2 --> 0.412 | Weights_l2 --> 16427.798 | Lr --> 0.005 | Seconds_per_step --> 0.350 | [2024-07-27 11:39:51,210][Main][INFO] - [train] Step 47400 out of 65536 | Loss --> 2.518 | Grad_l2 --> 0.410 | Weights_l2 --> 16429.279 | Lr --> 0.005 | Seconds_per_step --> 0.343 | [2024-07-27 11:40:25,037][Main][INFO] - [train] Step 47500 out of 65536 | Loss --> 2.492 | Grad_l2 --> 0.409 | Weights_l2 --> 16430.749 | Lr --> 0.005 | Seconds_per_step --> 0.338 | [2024-07-27 11:40:25,037][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-47500 [2024-07-27 11:40:25,039][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 11:40:26,717][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-47500/model.safetensors [2024-07-27 11:40:28,618][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-47500/optimizer.bin [2024-07-27 11:40:28,619][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-47500/scheduler.bin [2024-07-27 11:40:28,619][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-47500/sampler.bin [2024-07-27 11:40:28,619][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-47500/sampler_1.bin [2024-07-27 11:40:28,619][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-47500/random_states_0.pkl [2024-07-27 11:41:02,009][Main][INFO] - [train] Step 47600 out of 65536 | Loss --> 2.500 | Grad_l2 --> 0.410 | Weights_l2 --> 16432.144 | Lr --> 0.005 | Seconds_per_step --> 0.370 | [2024-07-27 11:41:35,296][Main][INFO] - [train] Step 47700 out of 65536 | Loss --> 2.488 | Grad_l2 --> 0.412 | Weights_l2 --> 16433.601 | Lr --> 0.005 | Seconds_per_step --> 0.333 | [2024-07-27 11:42:08,641][Main][INFO] - [train] Step 47800 out of 65536 | Loss --> 2.491 | Grad_l2 --> 0.406 | Weights_l2 --> 16434.955 | Lr --> 0.005 | Seconds_per_step --> 0.333 | [2024-07-27 11:42:41,987][Main][INFO] - [train] Step 47900 out of 65536 | Loss --> 2.494 | Grad_l2 --> 0.411 | Weights_l2 --> 16436.267 | Lr --> 0.005 | Seconds_per_step --> 0.333 | [2024-07-27 11:43:15,645][Main][INFO] - [train] Step 48000 out of 65536 | Loss --> 2.496 | Grad_l2 --> 0.411 | Weights_l2 --> 16437.504 | Lr --> 0.005 | Seconds_per_step --> 0.337 | [2024-07-27 11:43:49,411][Main][INFO] - [train] Step 48100 out of 65536 | Loss --> 2.495 | Grad_l2 --> 0.407 | Weights_l2 --> 16438.766 | Lr --> 0.004 | Seconds_per_step --> 0.338 | [2024-07-27 11:44:22,765][Main][INFO] - [train] Step 48200 out of 65536 | Loss --> 2.499 | Grad_l2 --> 0.411 | Weights_l2 --> 16439.988 | Lr --> 0.004 | Seconds_per_step --> 0.334 | [2024-07-27 11:44:56,103][Main][INFO] - [train] Step 48300 out of 65536 | Loss --> 2.495 | Grad_l2 --> 0.413 | Weights_l2 --> 16441.132 | Lr --> 0.004 | Seconds_per_step --> 0.333 | [2024-07-27 11:45:29,424][Main][INFO] - [train] Step 48400 out of 65536 | Loss --> 2.496 | Grad_l2 --> 0.408 | Weights_l2 --> 16442.290 | Lr --> 0.004 | Seconds_per_step --> 0.333 | [2024-07-27 11:46:02,730][Main][INFO] - [train] Step 48500 out of 65536 | Loss --> 2.493 | Grad_l2 --> 0.408 | Weights_l2 --> 16443.452 | Lr --> 0.004 | Seconds_per_step --> 0.333 | [2024-07-27 11:46:36,009][Main][INFO] - [train] Step 48600 out of 65536 | Loss --> 2.499 | Grad_l2 --> 0.408 | Weights_l2 --> 16444.535 | Lr --> 0.004 | Seconds_per_step --> 0.333 | [2024-07-27 11:47:10,924][Main][INFO] - [train] Step 48700 out of 65536 | Loss --> 2.484 | Grad_l2 --> 0.409 | Weights_l2 --> 16445.592 | Lr --> 0.004 | Seconds_per_step --> 0.349 | [2024-07-27 11:47:44,211][Main][INFO] - [train] Step 48800 out of 65536 | Loss --> 2.485 | Grad_l2 --> 0.408 | Weights_l2 --> 16446.629 | Lr --> 0.004 | Seconds_per_step --> 0.333 | [2024-07-27 11:48:17,481][Main][INFO] - [train] Step 48900 out of 65536 | Loss --> 2.484 | Grad_l2 --> 0.406 | Weights_l2 --> 16447.587 | Lr --> 0.004 | Seconds_per_step --> 0.333 | [2024-07-27 11:48:50,771][Main][INFO] - [train] Step 49000 out of 65536 | Loss --> 2.476 | Grad_l2 --> 0.401 | Weights_l2 --> 16448.551 | Lr --> 0.004 | Seconds_per_step --> 0.333 | [2024-07-27 11:49:24,122][Main][INFO] - [train] Step 49100 out of 65536 | Loss --> 2.476 | Grad_l2 --> 0.409 | Weights_l2 --> 16449.499 | Lr --> 0.004 | Seconds_per_step --> 0.334 | [2024-07-27 11:49:57,454][Main][INFO] - [train] Step 49200 out of 65536 | Loss --> 2.472 | Grad_l2 --> 0.409 | Weights_l2 --> 16450.437 | Lr --> 0.004 | Seconds_per_step --> 0.333 | [2024-07-27 11:50:31,124][Main][INFO] - [train] Step 49300 out of 65536 | Loss --> 2.479 | Grad_l2 --> 0.411 | Weights_l2 --> 16451.328 | Lr --> 0.004 | Seconds_per_step --> 0.337 | [2024-07-27 11:51:04,566][Main][INFO] - [train] Step 49400 out of 65536 | Loss --> 2.489 | Grad_l2 --> 0.412 | Weights_l2 --> 16452.224 | Lr --> 0.004 | Seconds_per_step --> 0.334 | [2024-07-27 11:51:37,893][Main][INFO] - [train] Step 49500 out of 65536 | Loss --> 2.478 | Grad_l2 --> 0.415 | Weights_l2 --> 16453.046 | Lr --> 0.004 | Seconds_per_step --> 0.333 | [2024-07-27 11:52:13,307][Main][INFO] - [train] Step 49600 out of 65536 | Loss --> 2.469 | Grad_l2 --> 0.406 | Weights_l2 --> 16453.878 | Lr --> 0.004 | Seconds_per_step --> 0.354 | [2024-07-27 11:52:47,302][Main][INFO] - [train] Step 49700 out of 65536 | Loss --> 2.473 | Grad_l2 --> 0.415 | Weights_l2 --> 16454.635 | Lr --> 0.004 | Seconds_per_step --> 0.340 | [2024-07-27 11:53:20,595][Main][INFO] - [train] Step 49800 out of 65536 | Loss --> 2.476 | Grad_l2 --> 0.410 | Weights_l2 --> 16455.410 | Lr --> 0.004 | Seconds_per_step --> 0.333 | [2024-07-27 11:53:54,160][Main][INFO] - [train] Step 49900 out of 65536 | Loss --> 2.458 | Grad_l2 --> 0.405 | Weights_l2 --> 16456.183 | Lr --> 0.004 | Seconds_per_step --> 0.336 | [2024-07-27 11:54:27,426][Main][INFO] - [train] Step 50000 out of 65536 | Loss --> 2.478 | Grad_l2 --> 0.408 | Weights_l2 --> 16456.881 | Lr --> 0.004 | Seconds_per_step --> 0.333 | [2024-07-27 11:54:27,426][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-50000 [2024-07-27 11:54:27,428][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 11:54:29,079][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-50000/model.safetensors [2024-07-27 11:54:30,981][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-50000/optimizer.bin [2024-07-27 11:54:30,981][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-50000/scheduler.bin [2024-07-27 11:54:30,981][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-50000/sampler.bin [2024-07-27 11:54:30,981][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-50000/sampler_1.bin [2024-07-27 11:54:30,982][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-50000/random_states_0.pkl [2024-07-27 11:55:04,486][Main][INFO] - [train] Step 50100 out of 65536 | Loss --> 2.464 | Grad_l2 --> 0.415 | Weights_l2 --> 16457.590 | Lr --> 0.004 | Seconds_per_step --> 0.371 | [2024-07-27 11:55:37,855][Main][INFO] - [train] Step 50200 out of 65536 | Loss --> 2.468 | Grad_l2 --> 0.403 | Weights_l2 --> 16458.237 | Lr --> 0.004 | Seconds_per_step --> 0.334 | [2024-07-27 11:56:11,139][Main][INFO] - [train] Step 50300 out of 65536 | Loss --> 2.470 | Grad_l2 --> 0.403 | Weights_l2 --> 16458.913 | Lr --> 0.003 | Seconds_per_step --> 0.333 | [2024-07-27 11:56:44,421][Main][INFO] - [train] Step 50400 out of 65536 | Loss --> 2.463 | Grad_l2 --> 0.401 | Weights_l2 --> 16459.534 | Lr --> 0.003 | Seconds_per_step --> 0.333 | [2024-07-27 11:57:18,022][Main][INFO] - [train] Step 50500 out of 65536 | Loss --> 2.442 | Grad_l2 --> 0.406 | Weights_l2 --> 16460.125 | Lr --> 0.003 | Seconds_per_step --> 0.336 | [2024-07-27 11:57:51,315][Main][INFO] - [train] Step 50600 out of 65536 | Loss --> 2.449 | Grad_l2 --> 0.409 | Weights_l2 --> 16460.691 | Lr --> 0.003 | Seconds_per_step --> 0.333 | [2024-07-27 11:58:24,615][Main][INFO] - [train] Step 50700 out of 65536 | Loss --> 2.457 | Grad_l2 --> 0.414 | Weights_l2 --> 16461.228 | Lr --> 0.003 | Seconds_per_step --> 0.333 | [2024-07-27 11:58:57,899][Main][INFO] - [train] Step 50800 out of 65536 | Loss --> 2.447 | Grad_l2 --> 0.411 | Weights_l2 --> 16461.795 | Lr --> 0.003 | Seconds_per_step --> 0.333 | [2024-07-27 11:59:31,174][Main][INFO] - [train] Step 50900 out of 65536 | Loss --> 2.455 | Grad_l2 --> 0.400 | Weights_l2 --> 16462.301 | Lr --> 0.003 | Seconds_per_step --> 0.333 | [2024-07-27 12:00:04,449][Main][INFO] - [train] Step 51000 out of 65536 | Loss --> 2.465 | Grad_l2 --> 0.406 | Weights_l2 --> 16462.777 | Lr --> 0.003 | Seconds_per_step --> 0.333 | [2024-07-27 12:00:38,024][Main][INFO] - [train] Step 51100 out of 65536 | Loss --> 2.446 | Grad_l2 --> 0.406 | Weights_l2 --> 16463.233 | Lr --> 0.003 | Seconds_per_step --> 0.336 | [2024-07-27 12:01:11,322][Main][INFO] - [train] Step 51200 out of 65536 | Loss --> 2.456 | Grad_l2 --> 0.414 | Weights_l2 --> 16463.675 | Lr --> 0.003 | Seconds_per_step --> 0.333 | [2024-07-27 12:01:44,623][Main][INFO] - [train] Step 51300 out of 65536 | Loss --> 2.446 | Grad_l2 --> 0.402 | Weights_l2 --> 16464.130 | Lr --> 0.003 | Seconds_per_step --> 0.333 | [2024-07-27 12:02:17,892][Main][INFO] - [train] Step 51400 out of 65536 | Loss --> 2.451 | Grad_l2 --> 0.401 | Weights_l2 --> 16464.558 | Lr --> 0.003 | Seconds_per_step --> 0.333 | [2024-07-27 12:02:51,162][Main][INFO] - [train] Step 51500 out of 65536 | Loss --> 2.445 | Grad_l2 --> 0.407 | Weights_l2 --> 16464.906 | Lr --> 0.003 | Seconds_per_step --> 0.333 | [2024-07-27 12:03:24,440][Main][INFO] - [train] Step 51600 out of 65536 | Loss --> 2.450 | Grad_l2 --> 0.407 | Weights_l2 --> 16465.307 | Lr --> 0.003 | Seconds_per_step --> 0.333 | [2024-07-27 12:03:57,711][Main][INFO] - [train] Step 51700 out of 65536 | Loss --> 2.458 | Grad_l2 --> 0.408 | Weights_l2 --> 16465.650 | Lr --> 0.003 | Seconds_per_step --> 0.333 | [2024-07-27 12:04:31,281][Main][INFO] - [train] Step 51800 out of 65536 | Loss --> 2.445 | Grad_l2 --> 0.403 | Weights_l2 --> 16465.983 | Lr --> 0.003 | Seconds_per_step --> 0.336 | [2024-07-27 12:05:04,570][Main][INFO] - [train] Step 51900 out of 65536 | Loss --> 2.446 | Grad_l2 --> 0.410 | Weights_l2 --> 16466.327 | Lr --> 0.003 | Seconds_per_step --> 0.333 | [2024-07-27 12:05:37,868][Main][INFO] - [train] Step 52000 out of 65536 | Loss --> 2.446 | Grad_l2 --> 0.403 | Weights_l2 --> 16466.635 | Lr --> 0.003 | Seconds_per_step --> 0.333 | [2024-07-27 12:06:11,136][Main][INFO] - [train] Step 52100 out of 65536 | Loss --> 2.442 | Grad_l2 --> 0.408 | Weights_l2 --> 16466.928 | Lr --> 0.003 | Seconds_per_step --> 0.333 | [2024-07-27 12:06:44,429][Main][INFO] - [train] Step 52200 out of 65536 | Loss --> 2.441 | Grad_l2 --> 0.408 | Weights_l2 --> 16467.217 | Lr --> 0.003 | Seconds_per_step --> 0.333 | [2024-07-27 12:07:17,701][Main][INFO] - [train] Step 52300 out of 65536 | Loss --> 2.448 | Grad_l2 --> 0.398 | Weights_l2 --> 16467.514 | Lr --> 0.003 | Seconds_per_step --> 0.333 | [2024-07-27 12:07:51,264][Main][INFO] - [train] Step 52400 out of 65536 | Loss --> 2.449 | Grad_l2 --> 0.407 | Weights_l2 --> 16467.782 | Lr --> 0.003 | Seconds_per_step --> 0.336 | [2024-07-27 12:08:24,744][Main][INFO] - [train] Step 52500 out of 65536 | Loss --> 2.440 | Grad_l2 --> 0.407 | Weights_l2 --> 16468.008 | Lr --> 0.003 | Seconds_per_step --> 0.335 | [2024-07-27 12:08:24,745][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-52500 [2024-07-27 12:08:24,747][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 12:08:26,410][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-52500/model.safetensors [2024-07-27 12:08:28,255][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-52500/optimizer.bin [2024-07-27 12:08:28,255][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-52500/scheduler.bin [2024-07-27 12:08:28,255][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-52500/sampler.bin [2024-07-27 12:08:28,255][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-52500/sampler_1.bin [2024-07-27 12:08:28,256][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-52500/random_states_0.pkl [2024-07-27 12:09:03,306][Main][INFO] - [train] Step 52600 out of 65536 | Loss --> 2.461 | Grad_l2 --> 0.406 | Weights_l2 --> 16468.224 | Lr --> 0.003 | Seconds_per_step --> 0.386 | [2024-07-27 12:09:37,320][Main][INFO] - [train] Step 52700 out of 65536 | Loss --> 2.439 | Grad_l2 --> 0.405 | Weights_l2 --> 16468.458 | Lr --> 0.003 | Seconds_per_step --> 0.340 | [2024-07-27 12:10:10,573][Main][INFO] - [train] Step 52800 out of 65536 | Loss --> 2.439 | Grad_l2 --> 0.412 | Weights_l2 --> 16468.660 | Lr --> 0.002 | Seconds_per_step --> 0.333 | [2024-07-27 12:10:43,826][Main][INFO] - [train] Step 52900 out of 65536 | Loss --> 2.434 | Grad_l2 --> 0.403 | Weights_l2 --> 16468.872 | Lr --> 0.002 | Seconds_per_step --> 0.333 | [2024-07-27 12:11:17,418][Main][INFO] - [train] Step 53000 out of 65536 | Loss --> 2.436 | Grad_l2 --> 0.401 | Weights_l2 --> 16469.048 | Lr --> 0.002 | Seconds_per_step --> 0.336 | [2024-07-27 12:11:50,976][Main][INFO] - [train] Step 53100 out of 65536 | Loss --> 2.438 | Grad_l2 --> 0.403 | Weights_l2 --> 16469.210 | Lr --> 0.002 | Seconds_per_step --> 0.336 | [2024-07-27 12:12:24,453][Main][INFO] - [train] Step 53200 out of 65536 | Loss --> 2.427 | Grad_l2 --> 0.409 | Weights_l2 --> 16469.359 | Lr --> 0.002 | Seconds_per_step --> 0.335 | [2024-07-27 12:12:57,732][Main][INFO] - [train] Step 53300 out of 65536 | Loss --> 2.425 | Grad_l2 --> 0.405 | Weights_l2 --> 16469.514 | Lr --> 0.002 | Seconds_per_step --> 0.333 | [2024-07-27 12:13:31,022][Main][INFO] - [train] Step 53400 out of 65536 | Loss --> 2.436 | Grad_l2 --> 0.399 | Weights_l2 --> 16469.677 | Lr --> 0.002 | Seconds_per_step --> 0.333 | [2024-07-27 12:14:04,297][Main][INFO] - [train] Step 53500 out of 65536 | Loss --> 2.443 | Grad_l2 --> 0.410 | Weights_l2 --> 16469.801 | Lr --> 0.002 | Seconds_per_step --> 0.333 | [2024-07-27 12:14:37,897][Main][INFO] - [train] Step 53600 out of 65536 | Loss --> 2.435 | Grad_l2 --> 0.400 | Weights_l2 --> 16469.912 | Lr --> 0.002 | Seconds_per_step --> 0.336 | [2024-07-27 12:15:11,191][Main][INFO] - [train] Step 53700 out of 65536 | Loss --> 2.441 | Grad_l2 --> 0.407 | Weights_l2 --> 16470.016 | Lr --> 0.002 | Seconds_per_step --> 0.333 | [2024-07-27 12:15:44,471][Main][INFO] - [train] Step 53800 out of 65536 | Loss --> 2.423 | Grad_l2 --> 0.407 | Weights_l2 --> 16470.105 | Lr --> 0.002 | Seconds_per_step --> 0.333 | [2024-07-27 12:16:17,730][Main][INFO] - [train] Step 53900 out of 65536 | Loss --> 2.436 | Grad_l2 --> 0.407 | Weights_l2 --> 16470.213 | Lr --> 0.002 | Seconds_per_step --> 0.333 | [2024-07-27 12:16:51,004][Main][INFO] - [train] Step 54000 out of 65536 | Loss --> 2.420 | Grad_l2 --> 0.400 | Weights_l2 --> 16470.284 | Lr --> 0.002 | Seconds_per_step --> 0.333 | [2024-07-27 12:17:24,261][Main][INFO] - [train] Step 54100 out of 65536 | Loss --> 2.418 | Grad_l2 --> 0.405 | Weights_l2 --> 16470.347 | Lr --> 0.002 | Seconds_per_step --> 0.333 | [2024-07-27 12:17:57,829][Main][INFO] - [train] Step 54200 out of 65536 | Loss --> 2.428 | Grad_l2 --> 0.401 | Weights_l2 --> 16470.408 | Lr --> 0.002 | Seconds_per_step --> 0.336 | [2024-07-27 12:18:31,089][Main][INFO] - [train] Step 54300 out of 65536 | Loss --> 2.428 | Grad_l2 --> 0.407 | Weights_l2 --> 16470.477 | Lr --> 0.002 | Seconds_per_step --> 0.333 | [2024-07-27 12:19:04,354][Main][INFO] - [train] Step 54400 out of 65536 | Loss --> 2.427 | Grad_l2 --> 0.403 | Weights_l2 --> 16470.522 | Lr --> 0.002 | Seconds_per_step --> 0.333 | [2024-07-27 12:19:37,608][Main][INFO] - [train] Step 54500 out of 65536 | Loss --> 2.411 | Grad_l2 --> 0.408 | Weights_l2 --> 16470.545 | Lr --> 0.002 | Seconds_per_step --> 0.333 | [2024-07-27 12:20:10,874][Main][INFO] - [train] Step 54600 out of 65536 | Loss --> 2.427 | Grad_l2 --> 0.406 | Weights_l2 --> 16470.578 | Lr --> 0.002 | Seconds_per_step --> 0.333 | [2024-07-27 12:20:44,156][Main][INFO] - [train] Step 54700 out of 65536 | Loss --> 2.414 | Grad_l2 --> 0.407 | Weights_l2 --> 16470.631 | Lr --> 0.002 | Seconds_per_step --> 0.333 | [2024-07-27 12:21:17,741][Main][INFO] - [train] Step 54800 out of 65536 | Loss --> 2.414 | Grad_l2 --> 0.403 | Weights_l2 --> 16470.666 | Lr --> 0.002 | Seconds_per_step --> 0.336 | [2024-07-27 12:21:51,011][Main][INFO] - [train] Step 54900 out of 65536 | Loss --> 2.425 | Grad_l2 --> 0.401 | Weights_l2 --> 16470.677 | Lr --> 0.002 | Seconds_per_step --> 0.333 | [2024-07-27 12:22:24,533][Main][INFO] - [train] Step 55000 out of 65536 | Loss --> 2.408 | Grad_l2 --> 0.406 | Weights_l2 --> 16470.670 | Lr --> 0.002 | Seconds_per_step --> 0.335 | [2024-07-27 12:22:24,534][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-55000 [2024-07-27 12:22:24,536][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 12:22:26,187][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-55000/model.safetensors [2024-07-27 12:22:28,337][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-55000/optimizer.bin [2024-07-27 12:22:28,338][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-55000/scheduler.bin [2024-07-27 12:22:28,338][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-55000/sampler.bin [2024-07-27 12:22:28,338][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-55000/sampler_1.bin [2024-07-27 12:22:28,339][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-55000/random_states_0.pkl [2024-07-27 12:23:01,829][Main][INFO] - [train] Step 55100 out of 65536 | Loss --> 2.419 | Grad_l2 --> 0.400 | Weights_l2 --> 16470.702 | Lr --> 0.002 | Seconds_per_step --> 0.373 | [2024-07-27 12:23:35,358][Main][INFO] - [train] Step 55200 out of 65536 | Loss --> 2.410 | Grad_l2 --> 0.406 | Weights_l2 --> 16470.680 | Lr --> 0.002 | Seconds_per_step --> 0.335 | [2024-07-27 12:24:08,646][Main][INFO] - [train] Step 55300 out of 65536 | Loss --> 2.428 | Grad_l2 --> 0.407 | Weights_l2 --> 16470.670 | Lr --> 0.002 | Seconds_per_step --> 0.333 | [2024-07-27 12:24:42,681][Main][INFO] - [train] Step 55400 out of 65536 | Loss --> 2.410 | Grad_l2 --> 0.403 | Weights_l2 --> 16470.660 | Lr --> 0.002 | Seconds_per_step --> 0.340 | [2024-07-27 12:25:16,233][Main][INFO] - [train] Step 55500 out of 65536 | Loss --> 2.408 | Grad_l2 --> 0.405 | Weights_l2 --> 16470.625 | Lr --> 0.002 | Seconds_per_step --> 0.336 | [2024-07-27 12:25:49,501][Main][INFO] - [train] Step 55600 out of 65536 | Loss --> 2.396 | Grad_l2 --> 0.405 | Weights_l2 --> 16470.587 | Lr --> 0.002 | Seconds_per_step --> 0.333 | [2024-07-27 12:26:22,783][Main][INFO] - [train] Step 55700 out of 65536 | Loss --> 2.404 | Grad_l2 --> 0.410 | Weights_l2 --> 16470.566 | Lr --> 0.002 | Seconds_per_step --> 0.333 | [2024-07-27 12:26:56,314][Main][INFO] - [train] Step 55800 out of 65536 | Loss --> 2.401 | Grad_l2 --> 0.401 | Weights_l2 --> 16470.554 | Lr --> 0.001 | Seconds_per_step --> 0.335 | [2024-07-27 12:27:30,206][Main][INFO] - [train] Step 55900 out of 65536 | Loss --> 2.423 | Grad_l2 --> 0.402 | Weights_l2 --> 16470.526 | Lr --> 0.001 | Seconds_per_step --> 0.339 | [2024-07-27 12:28:03,926][Main][INFO] - [train] Step 56000 out of 65536 | Loss --> 2.415 | Grad_l2 --> 0.410 | Weights_l2 --> 16470.460 | Lr --> 0.001 | Seconds_per_step --> 0.337 | [2024-07-27 12:28:37,185][Main][INFO] - [train] Step 56100 out of 65536 | Loss --> 2.418 | Grad_l2 --> 0.401 | Weights_l2 --> 16470.431 | Lr --> 0.001 | Seconds_per_step --> 0.333 | [2024-07-27 12:29:10,455][Main][INFO] - [train] Step 56200 out of 65536 | Loss --> 2.408 | Grad_l2 --> 0.409 | Weights_l2 --> 16470.388 | Lr --> 0.001 | Seconds_per_step --> 0.333 | [2024-07-27 12:29:43,740][Main][INFO] - [train] Step 56300 out of 65536 | Loss --> 2.415 | Grad_l2 --> 0.405 | Weights_l2 --> 16470.326 | Lr --> 0.001 | Seconds_per_step --> 0.333 | [2024-07-27 12:30:17,014][Main][INFO] - [train] Step 56400 out of 65536 | Loss --> 2.392 | Grad_l2 --> 0.407 | Weights_l2 --> 16470.276 | Lr --> 0.001 | Seconds_per_step --> 0.333 | [2024-07-27 12:30:50,291][Main][INFO] - [train] Step 56500 out of 65536 | Loss --> 2.409 | Grad_l2 --> 0.407 | Weights_l2 --> 16470.226 | Lr --> 0.001 | Seconds_per_step --> 0.333 | [2024-07-27 12:31:23,861][Main][INFO] - [train] Step 56600 out of 65536 | Loss --> 2.399 | Grad_l2 --> 0.405 | Weights_l2 --> 16470.164 | Lr --> 0.001 | Seconds_per_step --> 0.336 | [2024-07-27 12:31:57,137][Main][INFO] - [train] Step 56700 out of 65536 | Loss --> 2.418 | Grad_l2 --> 0.413 | Weights_l2 --> 16470.107 | Lr --> 0.001 | Seconds_per_step --> 0.333 | [2024-07-27 12:32:31,468][Main][INFO] - [train] Step 56800 out of 65536 | Loss --> 2.413 | Grad_l2 --> 0.410 | Weights_l2 --> 16470.042 | Lr --> 0.001 | Seconds_per_step --> 0.343 | [2024-07-27 12:33:04,721][Main][INFO] - [train] Step 56900 out of 65536 | Loss --> 2.418 | Grad_l2 --> 0.408 | Weights_l2 --> 16469.966 | Lr --> 0.001 | Seconds_per_step --> 0.333 | [2024-07-27 12:33:37,980][Main][INFO] - [train] Step 57000 out of 65536 | Loss --> 2.395 | Grad_l2 --> 0.410 | Weights_l2 --> 16469.901 | Lr --> 0.001 | Seconds_per_step --> 0.333 | [2024-07-27 12:34:11,266][Main][INFO] - [train] Step 57100 out of 65536 | Loss --> 2.393 | Grad_l2 --> 0.409 | Weights_l2 --> 16469.827 | Lr --> 0.001 | Seconds_per_step --> 0.333 | [2024-07-27 12:34:44,838][Main][INFO] - [train] Step 57200 out of 65536 | Loss --> 2.408 | Grad_l2 --> 0.415 | Weights_l2 --> 16469.754 | Lr --> 0.001 | Seconds_per_step --> 0.336 | [2024-07-27 12:35:18,128][Main][INFO] - [train] Step 57300 out of 65536 | Loss --> 2.421 | Grad_l2 --> 0.418 | Weights_l2 --> 16469.676 | Lr --> 0.001 | Seconds_per_step --> 0.333 | [2024-07-27 12:35:51,449][Main][INFO] - [train] Step 57400 out of 65536 | Loss --> 2.397 | Grad_l2 --> 0.408 | Weights_l2 --> 16469.631 | Lr --> 0.001 | Seconds_per_step --> 0.333 | [2024-07-27 12:36:25,557][Main][INFO] - [train] Step 57500 out of 65536 | Loss --> 2.401 | Grad_l2 --> 0.413 | Weights_l2 --> 16469.573 | Lr --> 0.001 | Seconds_per_step --> 0.341 | [2024-07-27 12:36:25,557][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-57500 [2024-07-27 12:36:25,559][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 12:36:27,186][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-57500/model.safetensors [2024-07-27 12:36:29,073][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-57500/optimizer.bin [2024-07-27 12:36:29,073][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-57500/scheduler.bin [2024-07-27 12:36:29,074][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-57500/sampler.bin [2024-07-27 12:36:29,074][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-57500/sampler_1.bin [2024-07-27 12:36:29,074][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-57500/random_states_0.pkl [2024-07-27 12:37:02,396][Main][INFO] - [train] Step 57600 out of 65536 | Loss --> 2.409 | Grad_l2 --> 0.411 | Weights_l2 --> 16469.495 | Lr --> 0.001 | Seconds_per_step --> 0.368 | [2024-07-27 12:37:35,952][Main][INFO] - [train] Step 57700 out of 65536 | Loss --> 2.404 | Grad_l2 --> 0.417 | Weights_l2 --> 16469.414 | Lr --> 0.001 | Seconds_per_step --> 0.336 | [2024-07-27 12:38:09,510][Main][INFO] - [train] Step 57800 out of 65536 | Loss --> 2.395 | Grad_l2 --> 0.417 | Weights_l2 --> 16469.371 | Lr --> 0.001 | Seconds_per_step --> 0.336 | [2024-07-27 12:38:43,115][Main][INFO] - [train] Step 57900 out of 65536 | Loss --> 2.416 | Grad_l2 --> 0.413 | Weights_l2 --> 16469.279 | Lr --> 0.001 | Seconds_per_step --> 0.336 | [2024-07-27 12:39:16,474][Main][INFO] - [train] Step 58000 out of 65536 | Loss --> 2.375 | Grad_l2 --> 0.420 | Weights_l2 --> 16469.210 | Lr --> 0.001 | Seconds_per_step --> 0.334 | [2024-07-27 12:39:49,778][Main][INFO] - [train] Step 58100 out of 65536 | Loss --> 2.408 | Grad_l2 --> 0.424 | Weights_l2 --> 16469.129 | Lr --> 0.001 | Seconds_per_step --> 0.333 | [2024-07-27 12:40:23,611][Main][INFO] - [train] Step 58200 out of 65536 | Loss --> 2.399 | Grad_l2 --> 0.421 | Weights_l2 --> 16469.064 | Lr --> 0.001 | Seconds_per_step --> 0.338 | [2024-07-27 12:40:56,889][Main][INFO] - [train] Step 58300 out of 65536 | Loss --> 2.395 | Grad_l2 --> 0.422 | Weights_l2 --> 16468.993 | Lr --> 0.001 | Seconds_per_step --> 0.333 | [2024-07-27 12:41:30,238][Main][INFO] - [train] Step 58400 out of 65536 | Loss --> 2.397 | Grad_l2 --> 0.430 | Weights_l2 --> 16468.922 | Lr --> 0.001 | Seconds_per_step --> 0.333 | [2024-07-27 12:42:03,813][Main][INFO] - [train] Step 58500 out of 65536 | Loss --> 2.401 | Grad_l2 --> 0.428 | Weights_l2 --> 16468.843 | Lr --> 0.001 | Seconds_per_step --> 0.336 | [2024-07-27 12:42:38,242][Main][INFO] - [train] Step 58600 out of 65536 | Loss --> 2.415 | Grad_l2 --> 0.434 | Weights_l2 --> 16468.776 | Lr --> 0.001 | Seconds_per_step --> 0.344 | [2024-07-27 12:43:11,790][Main][INFO] - [train] Step 58700 out of 65536 | Loss --> 2.402 | Grad_l2 --> 0.433 | Weights_l2 --> 16468.692 | Lr --> 0.001 | Seconds_per_step --> 0.335 | [2024-07-27 12:43:45,158][Main][INFO] - [train] Step 58800 out of 65536 | Loss --> 2.384 | Grad_l2 --> 0.438 | Weights_l2 --> 16468.634 | Lr --> 0.001 | Seconds_per_step --> 0.334 | [2024-07-27 12:44:19,803][Main][INFO] - [train] Step 58900 out of 65536 | Loss --> 2.383 | Grad_l2 --> 0.442 | Weights_l2 --> 16468.567 | Lr --> 0.001 | Seconds_per_step --> 0.346 | [2024-07-27 12:44:53,383][Main][INFO] - [train] Step 59000 out of 65536 | Loss --> 2.382 | Grad_l2 --> 0.442 | Weights_l2 --> 16468.495 | Lr --> 0.001 | Seconds_per_step --> 0.336 | [2024-07-27 12:45:27,103][Main][INFO] - [train] Step 59100 out of 65536 | Loss --> 2.402 | Grad_l2 --> 0.448 | Weights_l2 --> 16468.422 | Lr --> 0.001 | Seconds_per_step --> 0.337 | [2024-07-27 12:46:00,906][Main][INFO] - [train] Step 59200 out of 65536 | Loss --> 2.410 | Grad_l2 --> 0.444 | Weights_l2 --> 16468.361 | Lr --> 0.001 | Seconds_per_step --> 0.338 | [2024-07-27 12:46:34,177][Main][INFO] - [train] Step 59300 out of 65536 | Loss --> 2.398 | Grad_l2 --> 0.447 | Weights_l2 --> 16468.289 | Lr --> 0.001 | Seconds_per_step --> 0.333 | [2024-07-27 12:47:07,733][Main][INFO] - [train] Step 59400 out of 65536 | Loss --> 2.390 | Grad_l2 --> 0.451 | Weights_l2 --> 16468.233 | Lr --> 0.001 | Seconds_per_step --> 0.336 | [2024-07-27 12:47:41,241][Main][INFO] - [train] Step 59500 out of 65536 | Loss --> 2.390 | Grad_l2 --> 0.449 | Weights_l2 --> 16468.165 | Lr --> 0.001 | Seconds_per_step --> 0.335 | [2024-07-27 12:48:14,574][Main][INFO] - [train] Step 59600 out of 65536 | Loss --> 2.403 | Grad_l2 --> 0.457 | Weights_l2 --> 16468.104 | Lr --> 0.001 | Seconds_per_step --> 0.333 | [2024-07-27 12:48:48,167][Main][INFO] - [train] Step 59700 out of 65536 | Loss --> 2.410 | Grad_l2 --> 0.458 | Weights_l2 --> 16468.047 | Lr --> 0.001 | Seconds_per_step --> 0.336 | [2024-07-27 12:49:21,475][Main][INFO] - [train] Step 59800 out of 65536 | Loss --> 2.400 | Grad_l2 --> 0.466 | Weights_l2 --> 16467.990 | Lr --> 0.001 | Seconds_per_step --> 0.333 | [2024-07-27 12:49:55,871][Main][INFO] - [train] Step 59900 out of 65536 | Loss --> 2.395 | Grad_l2 --> 0.465 | Weights_l2 --> 16467.938 | Lr --> 0.001 | Seconds_per_step --> 0.344 | [2024-07-27 12:50:29,651][Main][INFO] - [train] Step 60000 out of 65536 | Loss --> 2.385 | Grad_l2 --> 0.461 | Weights_l2 --> 16467.886 | Lr --> 0.000 | Seconds_per_step --> 0.338 | [2024-07-27 12:50:29,652][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-60000 [2024-07-27 12:50:29,654][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 12:50:31,329][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-60000/model.safetensors [2024-07-27 12:50:33,141][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-60000/optimizer.bin [2024-07-27 12:50:33,141][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-60000/scheduler.bin [2024-07-27 12:50:33,141][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-60000/sampler.bin [2024-07-27 12:50:33,141][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-60000/sampler_1.bin [2024-07-27 12:50:33,142][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-60000/random_states_0.pkl [2024-07-27 12:51:06,443][Main][INFO] - [train] Step 60100 out of 65536 | Loss --> 2.398 | Grad_l2 --> 0.472 | Weights_l2 --> 16467.837 | Lr --> 0.000 | Seconds_per_step --> 0.368 | [2024-07-27 12:51:39,717][Main][INFO] - [train] Step 60200 out of 65536 | Loss --> 2.390 | Grad_l2 --> 0.471 | Weights_l2 --> 16467.784 | Lr --> 0.000 | Seconds_per_step --> 0.333 | [2024-07-27 12:52:13,345][Main][INFO] - [train] Step 60300 out of 65536 | Loss --> 2.413 | Grad_l2 --> 0.475 | Weights_l2 --> 16467.733 | Lr --> 0.000 | Seconds_per_step --> 0.336 | [2024-07-27 12:52:46,822][Main][INFO] - [train] Step 60400 out of 65536 | Loss --> 2.394 | Grad_l2 --> 0.479 | Weights_l2 --> 16467.684 | Lr --> 0.000 | Seconds_per_step --> 0.335 | [2024-07-27 12:53:20,254][Main][INFO] - [train] Step 60500 out of 65536 | Loss --> 2.402 | Grad_l2 --> 0.476 | Weights_l2 --> 16467.627 | Lr --> 0.000 | Seconds_per_step --> 0.334 | [2024-07-27 12:53:53,563][Main][INFO] - [train] Step 60600 out of 65536 | Loss --> 2.388 | Grad_l2 --> 0.487 | Weights_l2 --> 16467.582 | Lr --> 0.000 | Seconds_per_step --> 0.333 | [2024-07-27 12:54:26,861][Main][INFO] - [train] Step 60700 out of 65536 | Loss --> 2.400 | Grad_l2 --> 0.482 | Weights_l2 --> 16467.541 | Lr --> 0.000 | Seconds_per_step --> 0.333 | [2024-07-27 12:55:00,285][Main][INFO] - [train] Step 60800 out of 65536 | Loss --> 2.403 | Grad_l2 --> 0.485 | Weights_l2 --> 16467.497 | Lr --> 0.000 | Seconds_per_step --> 0.334 | [2024-07-27 12:55:33,993][Main][INFO] - [train] Step 60900 out of 65536 | Loss --> 2.402 | Grad_l2 --> 0.486 | Weights_l2 --> 16467.456 | Lr --> 0.000 | Seconds_per_step --> 0.337 | [2024-07-27 12:56:07,287][Main][INFO] - [train] Step 61000 out of 65536 | Loss --> 2.405 | Grad_l2 --> 0.493 | Weights_l2 --> 16467.414 | Lr --> 0.000 | Seconds_per_step --> 0.333 | [2024-07-27 12:56:40,573][Main][INFO] - [train] Step 61100 out of 65536 | Loss --> 2.411 | Grad_l2 --> 0.496 | Weights_l2 --> 16467.380 | Lr --> 0.000 | Seconds_per_step --> 0.333 | [2024-07-27 12:57:14,106][Main][INFO] - [train] Step 61200 out of 65536 | Loss --> 2.390 | Grad_l2 --> 0.501 | Weights_l2 --> 16467.343 | Lr --> 0.000 | Seconds_per_step --> 0.335 | [2024-07-27 12:57:48,048][Main][INFO] - [train] Step 61300 out of 65536 | Loss --> 2.398 | Grad_l2 --> 0.503 | Weights_l2 --> 16467.308 | Lr --> 0.000 | Seconds_per_step --> 0.339 | [2024-07-27 12:58:21,362][Main][INFO] - [train] Step 61400 out of 65536 | Loss --> 2.388 | Grad_l2 --> 0.500 | Weights_l2 --> 16467.276 | Lr --> 0.000 | Seconds_per_step --> 0.333 | [2024-07-27 12:58:56,212][Main][INFO] - [train] Step 61500 out of 65536 | Loss --> 2.383 | Grad_l2 --> 0.499 | Weights_l2 --> 16467.242 | Lr --> 0.000 | Seconds_per_step --> 0.348 | [2024-07-27 12:59:31,529][Main][INFO] - [train] Step 61600 out of 65536 | Loss --> 2.414 | Grad_l2 --> 0.501 | Weights_l2 --> 16467.211 | Lr --> 0.000 | Seconds_per_step --> 0.353 | [2024-07-27 13:00:06,515][Main][INFO] - [train] Step 61700 out of 65536 | Loss --> 2.389 | Grad_l2 --> 0.510 | Weights_l2 --> 16467.183 | Lr --> 0.000 | Seconds_per_step --> 0.350 | [2024-07-27 13:00:41,406][Main][INFO] - [train] Step 61800 out of 65536 | Loss --> 2.399 | Grad_l2 --> 0.512 | Weights_l2 --> 16467.156 | Lr --> 0.000 | Seconds_per_step --> 0.349 | [2024-07-27 13:01:16,176][Main][INFO] - [train] Step 61900 out of 65536 | Loss --> 2.396 | Grad_l2 --> 0.504 | Weights_l2 --> 16467.127 | Lr --> 0.000 | Seconds_per_step --> 0.348 | [2024-07-27 13:01:50,962][Main][INFO] - [train] Step 62000 out of 65536 | Loss --> 2.393 | Grad_l2 --> 0.509 | Weights_l2 --> 16467.102 | Lr --> 0.000 | Seconds_per_step --> 0.348 | [2024-07-27 13:02:25,145][Main][INFO] - [train] Step 62100 out of 65536 | Loss --> 2.390 | Grad_l2 --> 0.517 | Weights_l2 --> 16467.078 | Lr --> 0.000 | Seconds_per_step --> 0.342 | [2024-07-27 13:02:58,631][Main][INFO] - [train] Step 62200 out of 65536 | Loss --> 2.405 | Grad_l2 --> 0.517 | Weights_l2 --> 16467.055 | Lr --> 0.000 | Seconds_per_step --> 0.335 | [2024-07-27 13:03:31,940][Main][INFO] - [train] Step 62300 out of 65536 | Loss --> 2.409 | Grad_l2 --> 0.506 | Weights_l2 --> 16467.035 | Lr --> 0.000 | Seconds_per_step --> 0.333 | [2024-07-27 13:04:06,013][Main][INFO] - [train] Step 62400 out of 65536 | Loss --> 2.401 | Grad_l2 --> 0.521 | Weights_l2 --> 16467.016 | Lr --> 0.000 | Seconds_per_step --> 0.341 | [2024-07-27 13:04:41,127][Main][INFO] - [train] Step 62500 out of 65536 | Loss --> 2.378 | Grad_l2 --> 0.509 | Weights_l2 --> 16466.996 | Lr --> 0.000 | Seconds_per_step --> 0.351 | [2024-07-27 13:04:41,127][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-62500 [2024-07-27 13:04:41,129][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 13:04:43,027][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-62500/model.safetensors [2024-07-27 13:04:45,219][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-62500/optimizer.bin [2024-07-27 13:04:45,220][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-62500/scheduler.bin [2024-07-27 13:04:45,220][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-62500/sampler.bin [2024-07-27 13:04:45,220][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-62500/sampler_1.bin [2024-07-27 13:04:45,220][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-62500/random_states_0.pkl [2024-07-27 13:05:20,280][Main][INFO] - [train] Step 62600 out of 65536 | Loss --> 2.400 | Grad_l2 --> 0.520 | Weights_l2 --> 16466.975 | Lr --> 0.000 | Seconds_per_step --> 0.392 | [2024-07-27 13:05:55,458][Main][INFO] - [train] Step 62700 out of 65536 | Loss --> 2.387 | Grad_l2 --> 0.521 | Weights_l2 --> 16466.958 | Lr --> 0.000 | Seconds_per_step --> 0.352 | [2024-07-27 13:06:30,343][Main][INFO] - [train] Step 62800 out of 65536 | Loss --> 2.395 | Grad_l2 --> 0.522 | Weights_l2 --> 16466.945 | Lr --> 0.000 | Seconds_per_step --> 0.349 | [2024-07-27 13:07:05,233][Main][INFO] - [train] Step 62900 out of 65536 | Loss --> 2.397 | Grad_l2 --> 0.525 | Weights_l2 --> 16466.927 | Lr --> 0.000 | Seconds_per_step --> 0.349 | [2024-07-27 13:07:40,104][Main][INFO] - [train] Step 63000 out of 65536 | Loss --> 2.392 | Grad_l2 --> 0.519 | Weights_l2 --> 16466.914 | Lr --> 0.000 | Seconds_per_step --> 0.349 | [2024-07-27 13:08:14,651][Main][INFO] - [train] Step 63100 out of 65536 | Loss --> 2.382 | Grad_l2 --> 0.521 | Weights_l2 --> 16466.900 | Lr --> 0.000 | Seconds_per_step --> 0.345 | [2024-07-27 13:08:48,003][Main][INFO] - [train] Step 63200 out of 65536 | Loss --> 2.399 | Grad_l2 --> 0.523 | Weights_l2 --> 16466.887 | Lr --> 0.000 | Seconds_per_step --> 0.334 | [2024-07-27 13:09:21,934][Main][INFO] - [train] Step 63300 out of 65536 | Loss --> 2.387 | Grad_l2 --> 0.526 | Weights_l2 --> 16466.877 | Lr --> 0.000 | Seconds_per_step --> 0.339 | [2024-07-27 13:09:55,268][Main][INFO] - [train] Step 63400 out of 65536 | Loss --> 2.394 | Grad_l2 --> 0.524 | Weights_l2 --> 16466.865 | Lr --> 0.000 | Seconds_per_step --> 0.333 | [2024-07-27 13:10:28,599][Main][INFO] - [train] Step 63500 out of 65536 | Loss --> 2.392 | Grad_l2 --> 0.524 | Weights_l2 --> 16466.853 | Lr --> 0.000 | Seconds_per_step --> 0.333 | [2024-07-27 13:11:01,932][Main][INFO] - [train] Step 63600 out of 65536 | Loss --> 2.388 | Grad_l2 --> 0.522 | Weights_l2 --> 16466.844 | Lr --> 0.000 | Seconds_per_step --> 0.333 | [2024-07-27 13:11:35,235][Main][INFO] - [train] Step 63700 out of 65536 | Loss --> 2.400 | Grad_l2 --> 0.523 | Weights_l2 --> 16466.833 | Lr --> 0.000 | Seconds_per_step --> 0.333 | [2024-07-27 13:12:08,627][Main][INFO] - [train] Step 63800 out of 65536 | Loss --> 2.384 | Grad_l2 --> 0.533 | Weights_l2 --> 16466.826 | Lr --> 0.000 | Seconds_per_step --> 0.334 | [2024-07-27 13:12:42,021][Main][INFO] - [train] Step 63900 out of 65536 | Loss --> 2.379 | Grad_l2 --> 0.524 | Weights_l2 --> 16466.819 | Lr --> 0.000 | Seconds_per_step --> 0.334 | [2024-07-27 13:13:15,700][Main][INFO] - [train] Step 64000 out of 65536 | Loss --> 2.406 | Grad_l2 --> 0.522 | Weights_l2 --> 16466.812 | Lr --> 0.000 | Seconds_per_step --> 0.337 | [2024-07-27 13:13:48,998][Main][INFO] - [train] Step 64100 out of 65536 | Loss --> 2.400 | Grad_l2 --> 0.527 | Weights_l2 --> 16466.806 | Lr --> 0.000 | Seconds_per_step --> 0.333 | [2024-07-27 13:14:22,459][Main][INFO] - [train] Step 64200 out of 65536 | Loss --> 2.391 | Grad_l2 --> 0.529 | Weights_l2 --> 16466.799 | Lr --> 0.000 | Seconds_per_step --> 0.335 | [2024-07-27 13:14:55,774][Main][INFO] - [train] Step 64300 out of 65536 | Loss --> 2.400 | Grad_l2 --> 0.532 | Weights_l2 --> 16466.795 | Lr --> 0.000 | Seconds_per_step --> 0.333 | [2024-07-27 13:15:29,069][Main][INFO] - [train] Step 64400 out of 65536 | Loss --> 2.393 | Grad_l2 --> 0.525 | Weights_l2 --> 16466.790 | Lr --> 0.000 | Seconds_per_step --> 0.333 | [2024-07-27 13:16:02,575][Main][INFO] - [train] Step 64500 out of 65536 | Loss --> 2.381 | Grad_l2 --> 0.528 | Weights_l2 --> 16466.785 | Lr --> 0.000 | Seconds_per_step --> 0.335 | [2024-07-27 13:16:36,408][Main][INFO] - [train] Step 64600 out of 65536 | Loss --> 2.384 | Grad_l2 --> 0.526 | Weights_l2 --> 16466.782 | Lr --> 0.000 | Seconds_per_step --> 0.338 | [2024-07-27 13:17:09,716][Main][INFO] - [train] Step 64700 out of 65536 | Loss --> 2.407 | Grad_l2 --> 0.531 | Weights_l2 --> 16466.780 | Lr --> 0.000 | Seconds_per_step --> 0.333 | [2024-07-27 13:17:43,588][Main][INFO] - [train] Step 64800 out of 65536 | Loss --> 2.388 | Grad_l2 --> 0.523 | Weights_l2 --> 16466.778 | Lr --> 0.000 | Seconds_per_step --> 0.339 | [2024-07-27 13:18:17,010][Main][INFO] - [train] Step 64900 out of 65536 | Loss --> 2.408 | Grad_l2 --> 0.524 | Weights_l2 --> 16466.775 | Lr --> 0.000 | Seconds_per_step --> 0.334 | [2024-07-27 13:18:50,426][Main][INFO] - [train] Step 65000 out of 65536 | Loss --> 2.408 | Grad_l2 --> 0.525 | Weights_l2 --> 16466.773 | Lr --> 0.000 | Seconds_per_step --> 0.334 | [2024-07-27 13:18:50,426][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-65000 [2024-07-27 13:18:50,428][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 13:18:52,083][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-65000/model.safetensors [2024-07-27 13:18:53,905][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-65000/optimizer.bin [2024-07-27 13:18:53,906][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-65000/scheduler.bin [2024-07-27 13:18:53,906][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-65000/sampler.bin [2024-07-27 13:18:53,906][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-65000/sampler_1.bin [2024-07-27 13:18:53,907][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-65000/random_states_0.pkl [2024-07-27 13:19:27,740][Main][INFO] - [train] Step 65100 out of 65536 | Loss --> 2.401 | Grad_l2 --> 0.531 | Weights_l2 --> 16466.772 | Lr --> 0.000 | Seconds_per_step --> 0.373 | [2024-07-27 13:20:02,233][Main][INFO] - [train] Step 65200 out of 65536 | Loss --> 2.396 | Grad_l2 --> 0.525 | Weights_l2 --> 16466.770 | Lr --> 0.000 | Seconds_per_step --> 0.345 | [2024-07-27 13:20:35,753][Main][INFO] - [train] Step 65300 out of 65536 | Loss --> 2.393 | Grad_l2 --> 0.533 | Weights_l2 --> 16466.770 | Lr --> 0.000 | Seconds_per_step --> 0.335 | [2024-07-27 13:21:09,122][Main][INFO] - [train] Step 65400 out of 65536 | Loss --> 2.409 | Grad_l2 --> 0.526 | Weights_l2 --> 16466.768 | Lr --> 0.000 | Seconds_per_step --> 0.334 | [2024-07-27 13:21:42,516][Main][INFO] - [train] Step 65500 out of 65536 | Loss --> 2.400 | Grad_l2 --> 0.527 | Weights_l2 --> 16466.767 | Lr --> 0.000 | Seconds_per_step --> 0.334 | [2024-07-27 13:21:54,915][datasets_modules.datasets.c4.584d57ebe81c209b6c7f31727066d2c4b4bba37cb7092cdd83083d5ec11207db.c4][INFO] - generating examples from = https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-validation.00006-of-00008.json.gz [2024-07-27 13:21:54,915][datasets_modules.datasets.c4.584d57ebe81c209b6c7f31727066d2c4b4bba37cb7092cdd83083d5ec11207db.c4][INFO] - generating examples from = https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-validation.00000-of-00008.json.gz [2024-07-27 13:21:54,915][datasets_modules.datasets.c4.584d57ebe81c209b6c7f31727066d2c4b4bba37cb7092cdd83083d5ec11207db.c4][INFO] - generating examples from = https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-validation.00004-of-00008.json.gz [2024-07-27 13:21:54,916][datasets_modules.datasets.c4.584d57ebe81c209b6c7f31727066d2c4b4bba37cb7092cdd83083d5ec11207db.c4][INFO] - generating examples from = https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-validation.00003-of-00008.json.gz [2024-07-27 13:21:54,916][datasets_modules.datasets.c4.584d57ebe81c209b6c7f31727066d2c4b4bba37cb7092cdd83083d5ec11207db.c4][INFO] - generating examples from = https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-validation.00005-of-00008.json.gz [2024-07-27 13:21:54,916][datasets_modules.datasets.c4.584d57ebe81c209b6c7f31727066d2c4b4bba37cb7092cdd83083d5ec11207db.c4][INFO] - generating examples from = https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-validation.00002-of-00008.json.gz [2024-07-27 13:21:54,917][datasets_modules.datasets.c4.584d57ebe81c209b6c7f31727066d2c4b4bba37cb7092cdd83083d5ec11207db.c4][INFO] - generating examples from = https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-validation.00001-of-00008.json.gz [2024-07-27 13:21:54,917][datasets_modules.datasets.c4.584d57ebe81c209b6c7f31727066d2c4b4bba37cb7092cdd83083d5ec11207db.c4][INFO] - generating examples from = https://huggingface.co/datasets/allenai/c4/resolve/1ddc917116b730e1859edef32896ec5c16be51d0/en/c4-validation.00007-of-00008.json.gz [2024-07-27 13:23:13,725][Main][INFO] - [eval] Step 65537 out of 65536 | Loss --> 2.393 | Accuracy --> 0.560 | Time --> 78.957 | [2024-07-27 13:23:13,727][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-65537 [2024-07-27 13:23:13,730][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-07-27 13:23:15,259][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-65537/model.safetensors [2024-07-27 13:23:17,064][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-65537/optimizer.bin [2024-07-27 13:23:17,065][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-65537/scheduler.bin [2024-07-27 13:23:17,065][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-65537/sampler.bin [2024-07-27 13:23:17,065][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-65537/sampler_1.bin [2024-07-27 13:23:17,065][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-65537/random_states_0.pkl