|
nohup: ignoring input |
|
[2023-02-20 17:05:49,355] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. |
|
[2023-02-20 17:05:49,405] [INFO] [runner.py:548:main] cmd = /opt/conda/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNl19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None tune_gpt.py --deepspeed deepspeed.json --upload-experiment |
|
/opt/conda/lib/python3.8/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. |
|
from pandas import MultiIndex, Int64Index |
|
[2023-02-20 17:05:51,897] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.11.4 |
|
[2023-02-20 17:05:51,897] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6]} |
|
[2023-02-20 17:05:51,897] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=7, node_rank=0 |
|
[2023-02-20 17:05:51,897] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6]}) |
|
[2023-02-20 17:05:51,897] [INFO] [launch.py:162:main] dist_world_size=7 |
|
[2023-02-20 17:05:51,897] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 |
|
/opt/conda/lib/python3.8/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. |
|
from pandas import MultiIndex, Int64Index |
|
/opt/conda/lib/python3.8/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. |
|
from pandas import MultiIndex, Int64Index |
|
/opt/conda/lib/python3.8/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. |
|
from pandas import MultiIndex, Int64Index |
|
/opt/conda/lib/python3.8/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. |
|
from pandas import MultiIndex, Int64Index |
|
/opt/conda/lib/python3.8/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. |
|
from pandas import MultiIndex, Int64Index |
|
/opt/conda/lib/python3.8/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. |
|
from pandas import MultiIndex, Int64Index |
|
/opt/conda/lib/python3.8/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. |
|
from pandas import MultiIndex, Int64Index |
|
No config specified, defaulting to: apps/all |
|
No config specified, defaulting to: apps/all |
|
Found cached dataset apps (/home/.cache/huggingface/datasets/codeparrot___apps/all/0.0.0/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5) |
|
Found cached dataset apps (/home/.cache/huggingface/datasets/codeparrot___apps/all/0.0.0/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5) |
|
No config specified, defaulting to: apps/all |
|
Found cached dataset apps (/home/.cache/huggingface/datasets/codeparrot___apps/all/0.0.0/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5) |
|
No config specified, defaulting to: apps/all |
|
Found cached dataset apps (/home/.cache/huggingface/datasets/codeparrot___apps/all/0.0.0/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5) |
|
No config specified, defaulting to: apps/all |
|
Found cached dataset apps (/home/.cache/huggingface/datasets/codeparrot___apps/all/0.0.0/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5) |
|
No config specified, defaulting to: apps/all |
|
No config specified, defaulting to: apps/all |
|
Found cached dataset apps (/home/.cache/huggingface/datasets/codeparrot___apps/all/0.0.0/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5) |
|
Found cached dataset apps (/home/.cache/huggingface/datasets/codeparrot___apps/all/0.0.0/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5) |
|
Max length: 2048 |
|
PyTorch: setting up devices |
|
Max length: 2048 |
|
PyTorch: setting up devices |
|
Max length: 2048 |
|
PyTorch: setting up devices |
|
Max length: 2048 |
|
PyTorch: setting up devices |
|
[2023-02-20 17:06:11,414] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl |
|
Max length: 2048 |
|
PyTorch: setting up devices |
|
Max length: 2048 |
|
PyTorch: setting up devices |
|
Max length: 2048 |
|
PyTorch: setting up devices |
|
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-). |
|
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-). |
|
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-). |
|
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-). |
|
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-). |
|
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-). |
|
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-). |
|
GPU memory occupied: 6883 MB. |
|
GPU memory occupied: 6883 MB. |
|
GPU memory occupied: 6883 MB. |
|
GPU memory occupied: 6883 MB. |
|
GPU memory occupied: 6883 MB. |
|
GPU memory occupied: 6883 MB. |
|
GPU memory occupied: 6883 MB. |
|
[2023-02-20 17:06:12,424] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed info: version=0.8.1, git-hash=unknown, git-branch=unknown |
|
[2023-02-20 17:06:14,006] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False |
|
Installed CUDA version 11.6 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination |
|
Installed CUDA version 11.6 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination |
|
Installed CUDA version 11.6 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination |
|
Installed CUDA version 11.6 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination |
|
Installed CUDA version 11.6 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination |
|
Installed CUDA version 11.6 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination |
|
Installed CUDA version 11.6 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
Detected CUDA files, patching ldflags |
|
Emitting ninja build file /home/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja... |
|
Building extension module cpu_adam... |
|
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
ninja: no work to do. |
|
Loading extension module cpu_adam... |
|
Time to load cpu_adam op: 2.825831890106201 seconds |
|
Loading extension module cpu_adam... |
|
Time to load cpu_adam op: 2.6894984245300293 seconds |
|
Loading extension module cpu_adam... |
|
Time to load cpu_adam op: 2.815955877304077 seconds |
|
Loading extension module cpu_adam... |
|
Time to load cpu_adam op: 2.816244125366211 seconds |
|
Loading extension module cpu_adam... |
|
Time to load cpu_adam op: 2.7123100757598877 seconds |
|
Loading extension module cpu_adam... |
|
Time to load cpu_adam op: 2.8215184211730957 seconds |
|
Loading extension module cpu_adam... |
|
Time to load cpu_adam op: 2.789081573486328 seconds |
|
Adam Optimizer #0 is created with AVX2 arithmetic capability. |
|
Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.050000, adam_w=1 |
|
[2023-02-20 17:06:19,789] [INFO] [logging.py:75:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer |
|
[2023-02-20 17:06:19,794] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam |
|
[2023-02-20 17:06:19,795] [INFO] [utils.py:53:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> |
|
[2023-02-20 17:06:19,795] [INFO] [logging.py:75:log_dist] [Rank 0] Creating torch.float32 ZeRO stage 2 optimizer |
|
[2023-02-20 17:06:19,795] [INFO] [stage_1_and_2.py:144:__init__] Reduce bucket size 500000000 |
|
[2023-02-20 17:06:19,795] [INFO] [stage_1_and_2.py:145:__init__] Allgather bucket size 500000000 |
|
[2023-02-20 17:06:19,795] [INFO] [stage_1_and_2.py:146:__init__] CPU Offload: True |
|
[2023-02-20 17:06:19,795] [INFO] [stage_1_and_2.py:147:__init__] Round robin gradient partitioning: False |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
Emitting ninja build file /home/.cache/torch_extensions/py38_cu117/utils/build.ninja... |
|
Building extension module utils... |
|
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
|
ninja: no work to do. |
|
Loading extension module utils... |
|
Time to load utils op: 0.34289026260375977 seconds |
|
Loading extension module utils... |
|
Time to load utils op: 0.2027883529663086 seconds |
|
Loading extension module utils... |
|
Loading extension module utils... |
|
Time to load utils op: 0.2021796703338623 seconds |
|
Loading extension module utils... |
|
Time to load utils op: 0.2025763988494873 seconds |
|
Loading extension module utils... |
|
Time to load utils op: 0.2033846378326416 seconds |
|
Time to load utils op: 0.2029557228088379 seconds |
|
Loading extension module utils... |
|
Time to load utils op: 0.30292582511901855 seconds |
|
Rank: 6 partition count [7] and sizes[(17885514, False)] |
|
Rank: 5 partition count [7] and sizes[(17885514, False)] |
|
Rank: 4 partition count [7] and sizes[(17885514, False)] |
|
Rank: 2 partition count [7] and sizes[(17885514, False)] |
|
Rank: 3 partition count [7] and sizes[(17885514, False)] |
|
Rank: 1 partition count [7] and sizes[(17885514, False)] |
|
Rank: 0 partition count [7] and sizes[(17885514, False)] |
|
[2023-02-20 17:06:27,470] [INFO] [utils.py:825:see_memory_usage] Before initializing optimizer states |
|
[2023-02-20 17:06:27,471] [INFO] [utils.py:826:see_memory_usage] MA 0.66 GB Max_MA 0.66 GB CA 0.85 GB Max_CA 1 GB |
|
[2023-02-20 17:06:27,471] [INFO] [utils.py:834:see_memory_usage] CPU Virtual Memory: used = 39.85 GB, percent = 7.9% |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
No modifications detected for re-loaded extension module utils, skipping build step... |
|
Loading extension module utils... |
|
Time to load utils op: 0.00165557861328125 seconds |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
No modifications detected for re-loaded extension module utils, skipping build step... |
|
Loading extension module utils... |
|
Time to load utils op: 0.008014678955078125 seconds |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
No modifications detected for re-loaded extension module utils, skipping build step... |
|
Loading extension module utils... |
|
No modifications detected for re-loaded extension module utils, skipping build step... |
|
Loading extension module utils... |
|
Time to load utils op: 0.03653693199157715 seconds |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
No modifications detected for re-loaded extension module utils, skipping build step... |
|
Loading extension module utils... |
|
Time to load utils op: 0.008858203887939453 seconds |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
No modifications detected for re-loaded extension module utils, skipping build step... |
|
Loading extension module utils... |
|
Time to load utils op: 0.0007452964782714844 seconds |
|
Time to load utils op: 0.046510934829711914 seconds |
|
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. |
|
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. |
|
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. |
|
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. |
|
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. |
|
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. |
|
[2023-02-20 17:06:28,120] [INFO] [utils.py:825:see_memory_usage] After initializing optimizer states |
|
[2023-02-20 17:06:28,121] [INFO] [utils.py:826:see_memory_usage] MA 0.66 GB Max_MA 0.66 GB CA 0.85 GB Max_CA 1 GB |
|
[2023-02-20 17:06:28,121] [INFO] [utils.py:834:see_memory_usage] CPU Virtual Memory: used = 40.4 GB, percent = 8.0% |
|
[2023-02-20 17:06:28,121] [INFO] [stage_1_and_2.py:527:__init__] optimizer state initialized |
|
[2023-02-20 17:06:28,222] [INFO] [utils.py:825:see_memory_usage] After initializing ZeRO optimizer |
|
[2023-02-20 17:06:28,222] [INFO] [utils.py:826:see_memory_usage] MA 0.66 GB Max_MA 0.66 GB CA 0.85 GB Max_CA 1 GB |
|
[2023-02-20 17:06:28,223] [INFO] [utils.py:834:see_memory_usage] CPU Virtual Memory: used = 40.4 GB, percent = 8.0% |
|
[2023-02-20 17:06:28,223] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw |
|
[2023-02-20 17:06:28,223] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = WarmupLR |
|
[2023-02-20 17:06:28,223] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7efce1a33f40> |
|
[2023-02-20 17:06:28,224] [INFO] [logging.py:75:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-05], mom=[[0.9, 0.999]] |
|
[2023-02-20 17:06:28,226] [INFO] [config.py:1009:print] DeepSpeedEngine configuration: |
|
[2023-02-20 17:06:28,226] [INFO] [config.py:1013:print] activation_checkpointing_config { |
|
"partition_activations": false, |
|
"contiguous_memory_optimization": false, |
|
"cpu_checkpointing": false, |
|
"number_checkpoints": null, |
|
"synchronize_checkpoint_boundary": false, |
|
"profile": false |
|
} |
|
[2023-02-20 17:06:28,226] [INFO] [config.py:1013:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} |
|
[2023-02-20 17:06:28,226] [INFO] [config.py:1013:print] amp_enabled .................. False |
|
[2023-02-20 17:06:28,226] [INFO] [config.py:1013:print] amp_params ................... False |
|
[2023-02-20 17:06:28,226] [INFO] [config.py:1013:print] autotuning_config ............ { |
|
"enabled": false, |
|
"start_step": null, |
|
"end_step": null, |
|
"metric_path": null, |
|
"arg_mappings": null, |
|
"metric": "throughput", |
|
"model_info": null, |
|
"results_dir": "autotuning_results", |
|
"exps_dir": "autotuning_exps", |
|
"overwrite": true, |
|
"fast": true, |
|
"start_profile_step": 3, |
|
"end_profile_step": 5, |
|
"tuner_type": "gridsearch", |
|
"tuner_early_stopping": 5, |
|
"tuner_num_trials": 50, |
|
"model_info_path": null, |
|
"mp_size": 1, |
|
"max_train_batch_size": null, |
|
"min_train_batch_size": 1, |
|
"max_train_micro_batch_size_per_gpu": 1.024000e+03, |
|
"min_train_micro_batch_size_per_gpu": 1, |
|
"num_tuning_micro_batch_sizes": 3 |
|
} |
|
[2023-02-20 17:06:28,226] [INFO] [config.py:1013:print] bfloat16_enabled ............. False |
|
[2023-02-20 17:06:28,226] [INFO] [config.py:1013:print] checkpoint_parallel_write_pipeline False |
|
[2023-02-20 17:06:28,226] [INFO] [config.py:1013:print] checkpoint_tag_validation_enabled True |
|
[2023-02-20 17:06:28,226] [INFO] [config.py:1013:print] checkpoint_tag_validation_fail False |
|
[2023-02-20 17:06:28,226] [INFO] [config.py:1013:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7efd0eb74250> |
|
[2023-02-20 17:06:28,226] [INFO] [config.py:1013:print] communication_data_type ...... None |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] curriculum_enabled_legacy .... False |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] curriculum_params_legacy ..... False |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] data_efficiency_enabled ...... False |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] dataloader_drop_last ......... False |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] disable_allgather ............ False |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] dump_state ................... False |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] dynamic_loss_scale_args ...... None |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] eigenvalue_enabled ........... False |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] eigenvalue_gas_boundary_resolution 1 |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] eigenvalue_layer_name ........ bert.encoder.layer |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] eigenvalue_layer_num ......... 0 |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] eigenvalue_max_iter .......... 100 |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] eigenvalue_stability ......... 1e-06 |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] eigenvalue_tol ............... 0.01 |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] eigenvalue_verbose ........... False |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] elasticity_enabled ........... False |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] flops_profiler_config ........ { |
|
"enabled": false, |
|
"profile_step": 1, |
|
"module_depth": -1, |
|
"top_modules": 1, |
|
"detailed": true, |
|
"output_file": null |
|
} |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] fp16_auto_cast ............... None |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] fp16_enabled ................. False |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] fp16_master_weights_and_gradients False |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] global_rank .................. 0 |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] grad_accum_dtype ............. None |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] gradient_accumulation_steps .. 64 |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] gradient_clipping ............ 1.0 |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] gradient_predivide_factor .... 1.0 |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] initial_dynamic_scale ........ 65536 |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] load_universal_checkpoint .... False |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] loss_scale ................... 0 |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] memory_breakdown ............. False |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=True, output_path='logs/', job_name='train_neo') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=True |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] nebula_config ................ { |
|
"enabled": false, |
|
"persistent_storage_path": null, |
|
"persistent_time_interval": 100, |
|
"num_of_version_in_retention": 2, |
|
"enable_nebula_load": true, |
|
"load_path": null |
|
} |
|
[2023-02-20 17:06:28,227] [INFO] [config.py:1013:print] optimizer_legacy_fusion ...... False |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] optimizer_name ............... adamw |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] optimizer_params ............. {'lr': 5e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.05} |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] pld_enabled .................. False |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] pld_params ................... False |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] prescale_gradients ........... False |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] scheduler_name ............... WarmupLR |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 5e-05, 'warmup_num_steps': 500} |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] sparse_attention ............. None |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] sparse_gradients_enabled ..... False |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] steps_per_print .............. 2000 |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] train_batch_size ............. 2688 |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] train_micro_batch_size_per_gpu 6 |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] use_node_local_storage ....... False |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] wall_clock_breakdown ......... False |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] world_size ................... 7 |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] zero_allow_untested_optimizer True |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] zero_enabled ................. True |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:1013:print] zero_optimization_stage ...... 2 |
|
[2023-02-20 17:06:28,228] [INFO] [config.py:998:print_user_config] json = { |
|
"optimizer": { |
|
"type": "AdamW", |
|
"params": { |
|
"lr": 5e-05, |
|
"betas": [0.9, 0.999], |
|
"eps": 1e-08, |
|
"weight_decay": 0.05 |
|
} |
|
}, |
|
"scheduler": { |
|
"type": "WarmupLR", |
|
"params": { |
|
"warmup_min_lr": 0, |
|
"warmup_max_lr": 5e-05, |
|
"warmup_num_steps": 500 |
|
} |
|
}, |
|
"zero_optimization": { |
|
"stage": 2, |
|
"offload_optimizer": { |
|
"device": "cpu", |
|
"pin_memory": true |
|
}, |
|
"allgather_partitions": true, |
|
"allgather_bucket_size": 5.000000e+08, |
|
"overlap_comm": true, |
|
"reduce_scatter": true, |
|
"reduce_bucket_size": 5.000000e+08, |
|
"contiguous_gradients": true |
|
}, |
|
"tensorboard": { |
|
"enabled": true, |
|
"output_path": "logs/", |
|
"job_name": "train_neo" |
|
}, |
|
"zero_allow_untested_optimizer": true, |
|
"gradient_accumulation_steps": 64, |
|
"gradient_clipping": 1.0, |
|
"steps_per_print": 2.000000e+03, |
|
"train_batch_size": 2.688000e+03, |
|
"train_micro_batch_size_per_gpu": 6, |
|
"wall_clock_breakdown": false |
|
} |
|
Using /home/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... |
|
No modifications detected for re-loaded extension module utils, skipping build step... |
|
Loading extension module utils... |
|
Time to load utils op: 0.00042748451232910156 seconds |
|
***** Running training ***** |
|
Num examples = 117232 |
|
Num Epochs = 10 |
|
Instantaneous batch size per device = 6 |
|
Total train batch size (w. parallel, distributed & accumulation) = 2688 |
|
Gradient Accumulation steps = 64 |
|
Total optimization steps = 430 |
|
Number of trainable parameters = 125198592 |
|
0%| | 0/430 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. |
|
0%| | 1/430 [01:21<9:41:43, 81.36s/it]
{'loss': 6.8879, 'learning_rate': 0.0, 'epoch': 0.02} |
|
0%| | 1/430 [01:21<9:41:43, 81.36s/it]
0%| | 2/430 [02:41<9:36:10, 80.77s/it]
1%| | 3/430 [04:02<9:33:33, 80.59s/it]
1%| | 4/430 [05:22<9:31:45, 80.53s/it]
1%| | 5/430 [06:42<9:30:07, 80.49s/it]
{'loss': 4.4038, 'learning_rate': 1.294882868674145e-05, 'epoch': 0.11} |
|
1%| | 5/430 [06:42<9:30:07, 80.49s/it]
1%|β | 6/430 [08:03<9:28:34, 80.46s/it]
2%|β | 7/430 [09:24<9:28:07, 80.58s/it]
2%|β | 8/430 [10:44<9:26:23, 80.53s/it]
2%|β | 9/430 [12:05<9:24:58, 80.52s/it]
2%|β | 10/430 [13:25<9:23:23, 80.49s/it]
{'loss': 1.7286, 'learning_rate': 1.852558565662928e-05, 'epoch': 0.23} |
|
2%|β | 10/430 [13:25<9:23:23, 80.49s/it]
3%|β | 11/430 [14:45<9:22:02, 80.48s/it]
3%|β | 12/430 [16:06<9:20:48, 80.50s/it]
3%|β | 13/430 [17:26<9:19:19, 80.48s/it]
3%|β | 14/430 [18:47<9:18:17, 80.52s/it]
3%|β | 15/430 [20:08<9:16:48, 80.50s/it]
{'loss': 0.5715, 'learning_rate': 2.1787779359648994e-05, 'epoch': 0.34} |
|
3%|β | 15/430 [20:08<9:16:48, 80.50s/it]
4%|β | 16/430 [21:28<9:15:34, 80.52s/it]
4%|β | 17/430 [22:49<9:14:08, 80.51s/it]
4%|β | 18/430 [24:09<9:12:33, 80.47s/it]
4%|β | 19/430 [25:30<9:11:22, 80.49s/it]
5%|β | 20/430 [26:50<9:09:51, 80.47s/it]
{'loss': 0.5316, 'learning_rate': 2.41023426265171e-05, 'epoch': 0.46} |
|
5%|β | 20/430 [26:50<9:09:51, 80.47s/it]
5%|β | 21/430 [28:10<9:08:25, 80.45s/it]
5%|β | 22/430 [29:31<9:08:02, 80.59s/it]
5%|β | 23/430 [30:52<9:06:16, 80.53s/it]
6%|β | 24/430 [32:12<9:04:45, 80.51s/it]
6%|β | 25/430 [33:32<9:03:06, 80.46s/it]
{'loss': 0.5132, 'learning_rate': 2.58976573734829e-05, 'epoch': 0.57} |
|
6%|β | 25/430 [33:32<9:03:06, 80.46s/it]
6%|β | 26/430 [34:53<9:01:27, 80.42s/it]
6%|β | 27/430 [36:13<8:59:58, 80.39s/it]
7%|β | 28/430 [37:33<8:58:26, 80.36s/it]
7%|β | 29/430 [38:54<8:57:43, 80.46s/it]
7%|β | 30/430 [40:14<8:56:19, 80.45s/it]
{'loss': 0.4914, 'learning_rate': 2.7364536329536817e-05, 'epoch': 0.69} |
|
7%|β | 30/430 [40:14<8:56:19, 80.45s/it]
7%|β | 31/430 [41:35<8:55:05, 80.46s/it]
7%|β | 32/430 [42:55<8:53:43, 80.46s/it]
8%|β | 33/430 [44:16<8:52:17, 80.45s/it]
8%|β | 34/430 [45:36<8:50:56, 80.45s/it]
8%|β | 35/430 [46:57<8:49:37, 80.45s/it]
{'loss': 0.4777, 'learning_rate': 2.8604764815275082e-05, 'epoch': 0.8} |
|
8%|β | 35/430 [46:57<8:49:37, 80.45s/it]
8%|β | 36/430 [48:17<8:48:22, 80.46s/it]
9%|β | 37/430 [49:38<8:46:59, 80.46s/it]
9%|β | 38/430 [50:58<8:45:38, 80.46s/it]
9%|β | 39/430 [52:19<8:44:24, 80.47s/it]
9%|β | 40/430 [53:39<8:43:03, 80.47s/it]
{'loss': 0.4716, 'learning_rate': 2.9679099596404923e-05, 'epoch': 0.92} |
|
9%|β | 40/430 [53:39<8:43:03, 80.47s/it]
10%|β | 41/430 [55:00<8:42:48, 80.64s/it]
10%|β | 42/430 [56:21<8:41:07, 80.59s/it]
10%|β | 43/430 [57:41<8:39:31, 80.55s/it]
10%|β | 44/430 [59:50<10:12:21, 95.19s/it]
10%|β | 45/430 [1:01:11<9:42:25, 90.77s/it]
{'loss': 0.515, 'learning_rate': 3.0626730032556536e-05, 'epoch': 1.05} |
|
10%|β | 45/430 [1:01:11<9:42:25, 90.77s/it]
11%|β | 46/430 [1:02:31<9:21:05, 87.67s/it]
11%|β | 47/430 [1:03:52<9:06:42, 85.65s/it]
11%|β | 48/430 [1:05:13<8:55:45, 84.15s/it]
11%|ββ | 49/430 [1:06:33<8:47:15, 83.03s/it]
12%|ββ | 50/430 [1:07:54<8:41:00, 82.27s/it]
{'loss': 0.4496, 'learning_rate': 3.147441434337073e-05, 'epoch': 1.16} |
|
12%|ββ | 50/430 [1:07:54<8:41:00, 82.27s/it]
12%|ββ | 51/430 [1:09:14<8:36:14, 81.73s/it]
12%|ββ | 52/430 [1:10:35<8:32:33, 81.36s/it]
12%|ββ | 53/430 [1:11:55<8:30:00, 81.17s/it]
13%|ββ | 54/430 [1:13:16<8:27:22, 80.96s/it]
13%|ββ | 55/430 [1:14:37<8:25:21, 80.86s/it]
{'loss': 0.4428, 'learning_rate': 3.224123807782732e-05, 'epoch': 1.28} |
|
13%|ββ | 55/430 [1:14:37<8:25:21, 80.86s/it]
13%|ββ | 56/430 [1:15:57<8:23:29, 80.77s/it]
13%|ββ | 57/430 [1:17:18<8:21:34, 80.68s/it]
13%|ββ | 58/430 [1:18:38<8:19:56, 80.63s/it]
14%|ββ | 59/430 [1:19:59<8:18:24, 80.60s/it]
14%|ββ | 60/430 [1:21:19<8:16:44, 80.55s/it]
{'loss': 0.4354, 'learning_rate': 3.294129329942464e-05, 'epoch': 1.39} |
|
14%|ββ | 60/430 [1:21:19<8:16:44, 80.55s/it]
14%|ββ | 61/430 [1:22:40<8:15:19, 80.54s/it]
14%|ββ | 62/430 [1:24:00<8:13:53, 80.53s/it]
15%|ββ | 63/430 [1:25:21<8:12:48, 80.57s/it]
15%|ββ | 64/430 [1:26:41<8:11:20, 80.55s/it]
15%|ββ | 65/430 [1:28:02<8:09:50, 80.52s/it]
{'loss': 0.4267, 'learning_rate': 3.358528167653452e-05, 'epoch': 1.5} |
|
15%|ββ | 65/430 [1:28:02<8:09:50, 80.52s/it]
15%|ββ | 66/430 [1:29:22<8:08:31, 80.52s/it]
16%|ββ | 67/430 [1:30:43<8:07:03, 80.51s/it]
16%|ββ | 68/430 [1:32:03<8:05:39, 80.50s/it]
16%|ββ | 69/430 [1:33:24<8:04:17, 80.49s/it]
16%|ββ | 70/430 [1:34:44<8:02:50, 80.47s/it]
{'loss': 0.4245, 'learning_rate': 3.4181521785162905e-05, 'epoch': 1.62} |
|
16%|ββ | 70/430 [1:34:44<8:02:50, 80.47s/it]
17%|ββ | 71/430 [1:36:05<8:02:07, 80.58s/it]
17%|ββ | 72/430 [1:37:25<8:00:27, 80.52s/it]
17%|ββ | 73/430 [1:38:46<7:58:58, 80.50s/it]
17%|ββ | 74/430 [1:40:06<7:57:28, 80.47s/it]
17%|ββ | 75/430 [1:41:27<7:56:03, 80.46s/it]
{'loss': 0.4183, 'learning_rate': 3.473660804639045e-05, 'epoch': 1.73} |
|
17%|ββ | 75/430 [1:41:27<7:56:03, 80.46s/it]
18%|ββ | 76/430 [1:42:47<7:54:59, 80.51s/it]
18%|ββ | 77/430 [1:44:08<7:53:37, 80.50s/it]
18%|ββ | 78/430 [1:45:28<7:52:21, 80.52s/it]
18%|ββ | 79/430 [1:46:49<7:50:59, 80.51s/it]
19%|ββ | 80/430 [1:48:09<7:49:36, 80.51s/it]
{'loss': 0.4186, 'learning_rate': 3.525585656629274e-05, 'epoch': 1.85} |
|
19%|ββ | 80/430 [1:48:09<7:49:36, 80.51s/it]
19%|ββ | 81/430 [1:49:30<7:48:23, 80.53s/it]
19%|ββ | 82/430 [1:50:50<7:47:05, 80.53s/it]
19%|ββ | 83/430 [1:52:11<7:45:34, 80.50s/it]
20%|ββ | 84/430 [1:53:32<7:45:20, 80.69s/it]
20%|ββ | 85/430 [1:54:52<7:43:34, 80.62s/it]
{'loss': 0.4107, 'learning_rate': 3.574361557584177e-05, 'epoch': 1.96} |
|
20%|ββ | 85/430 [1:54:52<7:43:34, 80.62s/it]
20%|ββ | 86/430 [1:56:13<7:42:01, 80.59s/it]
20%|ββ | 87/430 [1:58:23<9:05:42, 95.46s/it]
20%|ββ | 88/430 [1:59:44<8:38:35, 90.98s/it]
21%|ββ | 89/430 [2:01:04<8:19:08, 87.82s/it]
21%|ββ | 90/430 [2:02:25<8:05:18, 85.64s/it]
{'loss': 0.4564, 'learning_rate': 3.6292389118326696e-05, 'epoch': 2.09} |
|
21%|ββ | 90/430 [2:02:25<8:05:18, 85.64s/it]
21%|ββ | 91/430 [2:03:45<7:55:07, 84.09s/it]
21%|βββ | 92/430 [2:05:06<7:47:38, 83.01s/it]
22%|βββ | 93/430 [2:06:26<7:42:05, 82.27s/it]
22%|βββ | 94/430 [2:07:47<7:37:48, 81.75s/it]
22%|βββ | 95/430 [2:09:07<7:34:41, 81.44s/it]
{'loss': 0.3989, 'learning_rate': 3.6722735522346654e-05, 'epoch': 2.21} |
|
22%|βββ | 95/430 [2:09:07<7:34:41, 81.44s/it]
22%|βββ | 96/430 [2:10:28<7:31:43, 81.15s/it]
23%|βββ | 97/430 [2:11:48<7:29:16, 80.95s/it]
23%|βββ | 98/430 [2:13:09<7:27:04, 80.80s/it]
23%|βββ | 99/430 [2:14:29<7:25:22, 80.73s/it]
23%|βββ | 100/430 [2:15:50<7:23:40, 80.67s/it]
{'loss': 0.3972, 'learning_rate': 3.713122729342321e-05, 'epoch': 2.32} |
|
23%|βββ | 100/430 [2:15:50<7:23:40, 80.67s/it]
23%|βββ | 101/430 [2:17:10<7:22:04, 80.62s/it]
24%|βββ | 102/430 [2:18:31<7:20:28, 80.58s/it]
24%|βββ | 103/430 [2:19:51<7:18:57, 80.54s/it]
24%|βββ | 104/430 [2:21:12<7:17:41, 80.56s/it]
24%|βββ | 105/430 [2:22:32<7:16:17, 80.55s/it]
{'loss': 0.3949, 'learning_rate': 3.751997728783617e-05, 'epoch': 2.44} |
|
24%|βββ | 105/430 [2:22:32<7:16:17, 80.55s/it]
25%|βββ | 106/430 [2:23:53<7:15:01, 80.56s/it]
25%|βββ | 107/430 [2:25:14<7:13:51, 80.59s/it]
25%|βββ | 108/430 [2:26:34<7:12:27, 80.58s/it]
25%|βββ | 109/430 [2:27:55<7:11:11, 80.60s/it]
26%|βββ | 110/430 [2:29:15<7:09:47, 80.59s/it]
{'loss': 0.3924, 'learning_rate': 3.789080603898437e-05, 'epoch': 2.55} |
|
26%|βββ | 110/430 [2:29:15<7:09:47, 80.59s/it]
26%|βββ | 111/430 [2:30:36<7:08:21, 80.57s/it]
26%|βββ | 112/430 [2:31:57<7:07:20, 80.63s/it]
26%|βββ | 113/430 [2:33:17<7:05:49, 80.60s/it]
27%|βββ | 114/430 [2:34:38<7:04:22, 80.58s/it]
27%|βββ | 115/430 [2:35:58<7:02:59, 80.57s/it]
{'loss': 0.3919, 'learning_rate': 3.8245293313935915e-05, 'epoch': 2.66} |
|
27%|βββ | 115/430 [2:35:58<7:02:59, 80.57s/it]
27%|βββ | 116/430 [2:37:20<7:02:34, 80.75s/it]
27%|βββ | 117/430 [2:38:40<7:00:55, 80.69s/it]
27%|βββ | 118/430 [2:40:01<6:59:23, 80.65s/it]
28%|βββ | 119/430 [2:41:21<6:57:54, 80.63s/it]
28%|βββ | 120/430 [2:42:42<6:56:26, 80.60s/it]
{'loss': 0.3858, 'learning_rate': 3.8584818782171724e-05, 'epoch': 2.78} |
|
28%|βββ | 120/430 [2:42:42<6:56:26, 80.60s/it]
28%|βββ | 121/430 [2:44:02<6:55:00, 80.58s/it]
28%|βββ | 122/430 [2:45:23<6:53:30, 80.55s/it]
29%|βββ | 123/430 [2:46:43<6:52:17, 80.58s/it]
29%|βββ | 124/430 [2:48:04<6:51:04, 80.60s/it]
29%|βββ | 125/430 [2:49:25<6:49:38, 80.59s/it]
{'loss': 0.3816, 'learning_rate': 3.8910594444236536e-05, 'epoch': 2.89} |
|
29%|βββ | 125/430 [2:49:25<6:49:38, 80.59s/it]
29%|βββ | 126/430 [2:50:45<6:48:24, 80.61s/it]
30%|βββ | 127/430 [2:52:06<6:47:00, 80.60s/it]
30%|βββ | 128/430 [2:53:27<6:46:32, 80.77s/it]
30%|βββ | 129/430 [2:54:48<6:44:56, 80.72s/it]
30%|βββ | 130/430 [2:56:57<7:56:43, 95.35s/it]
{'loss': 0.4213, 'learning_rate': 3.922369074599331e-05, 'epoch': 3.02} |
|
30%|βββ | 130/430 [2:56:57<7:56:43, 95.35s/it]
30%|βββ | 131/430 [2:58:18<7:32:58, 90.90s/it]
31%|βββ | 132/430 [2:59:38<7:15:53, 87.76s/it]
31%|βββ | 133/430 [3:00:59<7:03:46, 85.61s/it]
31%|βββ | 134/430 [3:02:19<6:54:52, 84.10s/it]
31%|ββββ | 135/430 [3:03:40<6:48:13, 83.03s/it]
{'loss': 0.3764, 'learning_rate': 3.9525057798763787e-05, 'epoch': 3.14} |
|
31%|ββββ | 135/430 [3:03:40<6:48:13, 83.03s/it]
32%|ββββ | 136/430 [3:05:00<6:43:08, 82.27s/it]
32%|ββββ | 137/430 [3:06:21<6:39:15, 81.76s/it]
32%|ββββ | 138/430 [3:07:41<6:36:06, 81.39s/it]
32%|ββββ | 139/430 [3:09:02<6:33:50, 81.20s/it]
33%|ββββ | 140/430 [3:10:23<6:31:28, 81.00s/it]
{'loss': 0.3702, 'learning_rate': 3.981554276636201e-05, 'epoch': 3.25} |
|
33%|ββββ | 140/430 [3:10:23<6:31:28, 81.00s/it]
33%|ββββ | 141/430 [3:11:43<6:29:27, 80.86s/it]
33%|ββββ | 142/430 [3:13:04<6:27:32, 80.74s/it]
33%|ββββ | 143/430 [3:14:24<6:26:05, 80.72s/it]
33%|ββββ | 144/430 [3:15:45<6:24:25, 80.65s/it]
34%|ββββ | 145/430 [3:17:05<6:22:43, 80.57s/it]
{'loss': 0.3689, 'learning_rate': 4.0095904221004775e-05, 'epoch': 3.37} |
|
34%|ββββ | 145/430 [3:17:05<6:22:43, 80.57s/it]
34%|ββββ | 146/430 [3:18:26<6:21:11, 80.53s/it]
34%|ββββ | 147/430 [3:19:46<6:19:49, 80.53s/it]
34%|ββββ | 148/430 [3:21:07<6:18:31, 80.54s/it]
35%|ββββ | 149/430 [3:22:27<6:17:16, 80.56s/it]
35%|ββββ | 150/430 [3:23:48<6:15:53, 80.55s/it]
{'loss': 0.3674, 'learning_rate': 4.0366824080900185e-05, 'epoch': 3.48} |
|
35%|ββββ | 150/430 [3:23:48<6:15:53, 80.55s/it]Saving model checkpoint to ./results/checkpoint-150 |
|
Configuration saved in ./results/checkpoint-150/config.json |
|
Model weights saved in ./results/checkpoint-150/pytorch_model.bin |
|
tokenizer config file saved in ./results/checkpoint-150/tokenizer_config.json |
|
Special tokens file saved in ./results/checkpoint-150/special_tokens_map.json |
|
[2023-02-20 20:30:17,356] [INFO] [logging.py:75:log_dist] [Rank 0] [Torch] Checkpoint global_step151 is begin to save! |
|
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1428: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. |
|
warnings.warn( |
|
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1428: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. |
|
warnings.warn( |
|
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1428: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. |
|
warnings.warn( |
|
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1428: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. |
|
warnings.warn( |
|
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1428: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. |
|
warnings.warn( |
|
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1428: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. |
|
warnings.warn( |
|
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1428: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. |
|
warnings.warn( |
|
[2023-02-20 20:30:17,359] [INFO] [logging.py:75:log_dist] [Rank 0] Saving model checkpoint: ./results/checkpoint-150/global_step151/mp_rank_00_model_states.pt |
|
[2023-02-20 20:30:17,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./results/checkpoint-150/global_step151/mp_rank_00_model_states.pt... |
|
[2023-02-20 20:30:18,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./results/checkpoint-150/global_step151/mp_rank_00_model_states.pt. |
|
[2023-02-20 20:30:18,019] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./results/checkpoint-150/global_step151/zero_pp_rank_0_mp_rank_00_optim_states.pt... |
|
[2023-02-20 20:30:18,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./results/checkpoint-150/global_step151/zero_pp_rank_0_mp_rank_00_optim_states.pt. |
|
[2023-02-20 20:30:18,223] [INFO] [engine.py:3407:_save_zero_checkpoint] zero checkpoint saved ./results/checkpoint-150/global_step151/zero_pp_rank_0_mp_rank_00_optim_states.pt |
|
[2023-02-20 20:30:18,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step151 is ready now! |
|
Deleting older checkpoint [results/checkpoint-45] due to args.save_total_limit |
|
35%|ββββ | 151/430 [3:25:10<6:16:59, 81.07s/it]
35%|ββββ | 152/430 [3:26:31<6:14:42, 80.87s/it]
36%|ββββ | 153/430 [3:27:51<6:12:50, 80.76s/it]
36%|ββββ | 154/430 [3:29:11<6:11:03, 80.67s/it]
36%|ββββ | 155/430 [3:30:32<6:09:25, 80.60s/it]
{'loss': 0.3634, 'learning_rate': 4.062891760247626e-05, 'epoch': 3.6} |
|
36%|ββββ | 155/430 [3:30:32<6:09:25, 80.60s/it]
36%|ββββ | 156/430 [3:31:52<6:07:55, 80.57s/it]
37%|ββββ | 157/430 [3:33:13<6:07:08, 80.69s/it]
37%|ββββ | 158/430 [3:34:34<6:05:33, 80.64s/it]
37%|ββββ | 159/430 [3:35:54<6:04:05, 80.61s/it]
37%|ββββ | 160/430 [3:37:15<6:02:37, 80.58s/it]
{'loss': 0.3622, 'learning_rate': 4.0882741795693975e-05, 'epoch': 3.71} |
|
37%|ββββ | 160/430 [3:37:15<6:02:37, 80.58s/it]
37%|ββββ | 161/430 [3:38:36<6:01:14, 80.58s/it]
38%|ββββ | 162/430 [3:39:56<6:00:01, 80.60s/it]
38%|ββββ | 163/430 [3:41:17<5:59:32, 80.80s/it]
38%|ββββ | 164/430 [3:42:38<5:58:27, 80.86s/it]
38%|ββββ | 165/430 [3:43:59<5:56:38, 80.75s/it]
{'loss': 0.3545, 'learning_rate': 4.1128802551961496e-05, 'epoch': 3.83} |
|
38%|ββββ | 165/430 [3:43:59<5:56:38, 80.75s/it]
39%|ββββ | 166/430 [3:45:20<5:55:05, 80.70s/it]
39%|ββββ | 167/430 [3:46:40<5:53:32, 80.66s/it]
39%|ββββ | 168/430 [3:48:01<5:52:04, 80.63s/it]
39%|ββββ | 169/430 [3:49:21<5:50:54, 80.67s/it]
40%|ββββ | 170/430 [3:50:42<5:49:25, 80.64s/it]
{'loss': 0.355, 'learning_rate': 4.136756071398985e-05, 'epoch': 3.94} |
|
40%|ββββ | 170/430 [3:50:42<5:49:25, 80.64s/it]
40%|ββββ | 171/430 [3:52:03<5:48:04, 80.64s/it]
40%|ββββ | 172/430 [3:53:23<5:46:33, 80.59s/it]
40%|ββββ | 173/430 [3:55:33<6:48:29, 95.37s/it]
40%|ββββ | 174/430 [3:56:53<6:27:52, 90.91s/it]
41%|ββββ | 175/430 [3:58:14<6:13:17, 87.83s/it]
{'loss': 0.3978, 'learning_rate': 4.164502129979834e-05, 'epoch': 4.07} |
|
41%|ββββ | 175/430 [3:58:14<6:13:17, 87.83s/it]
41%|ββββ | 176/430 [3:59:35<6:02:43, 85.68s/it]
41%|ββββ | 177/430 [4:00:55<5:54:43, 84.12s/it]
41%|βββββ | 178/430 [4:02:16<5:48:45, 83.04s/it]
42%|βββββ | 179/430 [4:03:36<5:44:09, 82.27s/it]
42%|βββββ | 180/430 [4:04:57<5:40:41, 81.77s/it]
{'loss': 0.3492, 'learning_rate': 4.186914608821452e-05, 'epoch': 4.18} |
|
42%|βββββ | 180/430 [4:04:57<5:40:41, 81.77s/it]
42%|βββββ | 181/430 [4:06:17<5:37:50, 81.41s/it]
42%|βββββ | 182/430 [4:07:38<5:35:22, 81.14s/it]
43%|βββββ | 183/430 [4:08:58<5:33:16, 80.96s/it]
43%|βββββ | 184/430 [4:10:19<5:31:25, 80.83s/it]
43%|βββββ | 185/430 [4:11:40<5:29:40, 80.74s/it]
{'loss': 0.3483, 'learning_rate': 4.208719628018618e-05, 'epoch': 4.3} |
|
43%|βββββ | 185/430 [4:11:40<5:29:40, 80.74s/it]
43%|βββββ | 186/430 [4:13:00<5:28:03, 80.67s/it]
43%|βββββ | 187/430 [4:14:21<5:26:44, 80.68s/it]
44%|βββββ | 188/430 [4:15:41<5:25:24, 80.68s/it]
44%|βββββ | 189/430 [4:17:02<5:23:51, 80.63s/it]
44%|βββββ | 190/430 [4:18:22<5:22:20, 80.58s/it]
{'loss': 0.3443, 'learning_rate': 4.229949249223448e-05, 'epoch': 4.41} |
|
44%|βββββ | 190/430 [4:18:22<5:22:20, 80.58s/it]
44%|βββββ | 191/430 [4:19:43<5:20:53, 80.56s/it]
45%|βββββ | 192/430 [4:21:03<5:19:33, 80.56s/it]
45%|βββββ | 193/430 [4:22:24<5:18:45, 80.70s/it]
45%|βββββ | 194/430 [4:23:45<5:17:14, 80.65s/it]
45%|βββββ | 195/430 [4:25:06<5:15:47, 80.63s/it]
{'loss': 0.3448, 'learning_rate': 4.250633060899951e-05, 'epoch': 4.53} |
|
45%|βββββ | 195/430 [4:25:06<5:15:47, 80.63s/it]
46%|βββββ | 196/430 [4:26:26<5:14:14, 80.57s/it]
46%|βββββ | 197/430 [4:27:47<5:12:46, 80.54s/it]
46%|βββββ | 198/430 [4:29:07<5:11:27, 80.55s/it]
46%|βββββ | 199/430 [4:30:28<5:10:06, 80.55s/it]
47%|βββββ | 200/430 [4:31:48<5:08:45, 80.55s/it]
{'loss': 0.3371, 'learning_rate': 4.2707984263311035e-05, 'epoch': 4.64} |
|
47%|βββββ | 200/430 [4:31:48<5:08:45, 80.55s/it]
47%|βββββ | 201/430 [4:33:09<5:07:25, 80.55s/it]
47%|βββββ | 202/430 [4:34:29<5:06:03, 80.54s/it]
47%|βββββ | 203/430 [4:35:50<5:04:47, 80.56s/it]
47%|βββββ | 204/430 [4:37:10<5:03:25, 80.55s/it]
48%|βββββ | 205/430 [4:38:31<5:02:05, 80.56s/it]
{'loss': 0.3362, 'learning_rate': 4.290470701297542e-05, 'epoch': 4.76} |
|
48%|βββββ | 205/430 [4:38:31<5:02:05, 80.56s/it]
48%|βββββ | 206/430 [4:39:52<5:00:48, 80.57s/it]
48%|βββββ | 207/430 [4:41:12<4:59:23, 80.55s/it]
48%|βββββ | 208/430 [4:42:33<4:58:04, 80.56s/it]
49%|βββββ | 209/430 [4:43:53<4:56:40, 80.54s/it]
49%|βββββ | 210/430 [4:45:14<4:55:18, 80.54s/it]
{'loss': 0.3333, 'learning_rate': 4.3096734257723994e-05, 'epoch': 4.87} |
|
49%|βββββ | 210/430 [4:45:14<4:55:18, 80.54s/it]
49%|βββββ | 211/430 [4:46:35<4:54:27, 80.67s/it]
49%|βββββ | 212/430 [4:47:56<4:53:37, 80.81s/it]
50%|βββββ | 213/430 [4:49:16<4:52:03, 80.75s/it]
50%|βββββ | 214/430 [4:50:37<4:50:43, 80.75s/it]
50%|βββββ | 215/430 [4:51:58<4:49:04, 80.67s/it]
{'loss': 0.3304, 'learning_rate': 4.3284284932676175e-05, 'epoch': 4.99} |
|
50%|βββββ | 215/430 [4:51:58<4:49:04, 80.67s/it]
50%|βββββ | 216/430 [4:54:08<5:40:27, 95.45s/it]
50%|βββββ | 217/430 [4:55:28<5:22:58, 90.98s/it]
51%|βββββ | 218/430 [4:56:49<5:10:22, 87.84s/it]
51%|βββββ | 219/430 [4:58:09<5:01:10, 85.64s/it]
51%|βββββ | 220/430 [4:59:30<4:54:31, 84.15s/it]
{'loss': 0.3689, 'learning_rate': 4.350372288827778e-05, 'epoch': 5.11} |
|
51%|βββββ | 220/430 [4:59:30<4:54:31, 84.15s/it]
51%|ββββββ | 221/430 [5:00:50<4:49:24, 83.09s/it]
52%|ββββββ | 222/430 [5:02:11<4:45:37, 82.39s/it]
52%|ββββββ | 223/430 [5:03:32<4:42:20, 81.84s/it]
52%|ββββββ | 224/430 [5:04:52<4:39:35, 81.44s/it]
52%|ββββββ | 225/430 [5:06:13<4:37:20, 81.17s/it]
{'loss': 0.3281, 'learning_rate': 4.3682123980857955e-05, 'epoch': 5.23} |
|
52%|ββββββ | 225/430 [5:06:13<4:37:20, 81.17s/it]
53%|ββββββ | 226/430 [5:07:33<4:35:24, 81.00s/it]
53%|ββββββ | 227/430 [5:08:54<4:33:33, 80.85s/it]
53%|ββββββ | 228/430 [5:10:14<4:31:47, 80.73s/it]
53%|ββββββ | 229/430 [5:11:35<4:30:14, 80.67s/it]
53%|ββββββ | 230/430 [5:12:55<4:28:43, 80.62s/it]
{'loss': 0.321, 'learning_rate': 4.3856654894696003e-05, 'epoch': 5.34} |
|
53%|ββββββ | 230/430 [5:12:55<4:28:43, 80.62s/it]
54%|ββββββ | 231/430 [5:14:16<4:27:16, 80.58s/it]
54%|ββββββ | 232/430 [5:15:36<4:25:49, 80.55s/it]
54%|ββββββ | 233/430 [5:16:57<4:24:23, 80.53s/it]
54%|ββββββ | 234/430 [5:18:17<4:22:58, 80.50s/it]
55%|ββββββ | 235/430 [5:19:38<4:21:32, 80.48s/it]
{'loss': 0.3211, 'learning_rate': 4.402747998752177e-05, 'epoch': 5.46} |
|
55%|ββββββ | 235/430 [5:19:38<4:21:32, 80.48s/it]
55%|ββββββ | 236/430 [5:20:58<4:20:16, 80.50s/it]
55%|ββββββ | 237/430 [5:22:19<4:18:53, 80.49s/it]
55%|ββββββ | 238/430 [5:23:39<4:17:33, 80.49s/it]
56%|ββββββ | 239/430 [5:25:00<4:16:10, 80.47s/it]
56%|ββββββ | 240/430 [5:26:20<4:14:47, 80.46s/it]
{'loss': 0.3177, 'learning_rate': 4.41947533645377e-05, 'epoch': 5.57} |
|
56%|ββββββ | 240/430 [5:26:20<4:14:47, 80.46s/it]
56%|ββββββ | 241/430 [5:27:40<4:13:24, 80.45s/it]
56%|ββββββ | 242/430 [5:29:01<4:12:06, 80.46s/it]
57%|ββββββ | 243/430 [5:30:21<4:10:48, 80.47s/it]
57%|ββββββ | 244/430 [5:31:42<4:09:29, 80.48s/it]
57%|ββββββ | 245/430 [5:33:02<4:08:08, 80.48s/it]
{'loss': 0.3178, 'learning_rate': 4.435861971380601e-05, 'epoch': 5.69} |
|
57%|ββββββ | 245/430 [5:33:02<4:08:08, 80.48s/it]
57%|ββββββ | 246/430 [5:34:23<4:06:46, 80.47s/it]
57%|ββββββ | 247/430 [5:35:44<4:05:59, 80.65s/it]
58%|ββββββ | 248/430 [5:37:04<4:04:28, 80.59s/it]
58%|ββββββ | 249/430 [5:38:25<4:03:06, 80.59s/it]
58%|ββββββ | 250/430 [5:39:46<4:01:45, 80.59s/it]
{'loss': 0.3147, 'learning_rate': 4.451921505824621e-05, 'epoch': 5.8} |
|
58%|ββββββ | 250/430 [5:39:46<4:01:45, 80.59s/it]
58%|ββββββ | 251/430 [5:41:06<4:00:32, 80.63s/it]
59%|ββββββ | 252/430 [5:42:27<3:59:37, 80.77s/it]
59%|ββββββ | 253/430 [5:43:48<3:58:03, 80.70s/it]
59%|ββββββ | 254/430 [5:45:08<3:56:32, 80.64s/it]
59%|ββββββ | 255/430 [5:46:29<3:55:05, 80.60s/it]
{'loss': 0.3131, 'learning_rate': 4.467666743403693e-05, 'epoch': 5.92} |
|
59%|ββββββ | 255/430 [5:46:29<3:55:05, 80.60s/it]
60%|ββββββ | 256/430 [5:47:50<3:54:05, 80.72s/it]
60%|ββββββ | 257/430 [5:49:10<3:52:33, 80.65s/it]
60%|ββββββ | 258/430 [5:50:31<3:51:09, 80.64s/it]
60%|ββββββ | 259/430 [5:52:40<4:31:29, 95.26s/it]
60%|ββββββ | 260/430 [5:54:01<4:17:26, 90.86s/it]
{'loss': 0.3454, 'learning_rate': 4.483109750389942e-05, 'epoch': 6.05} |
|
60%|ββββββ | 260/430 [5:54:01<4:17:26, 90.86s/it]
61%|ββββββ | 261/430 [5:55:22<4:07:10, 87.75s/it]
61%|ββββββ | 262/430 [5:56:42<3:59:47, 85.64s/it]
61%|ββββββ | 263/430 [5:58:03<3:54:02, 84.09s/it]
61%|βββββββ | 264/430 [5:59:23<3:49:37, 83.00s/it]
62%|βββββββ | 265/430 [6:00:44<3:46:08, 82.23s/it]
{'loss': 0.3044, 'learning_rate': 4.498261911262221e-05, 'epoch': 6.16} |
|
62%|βββββββ | 265/430 [6:00:44<3:46:08, 82.23s/it]
62%|βββββββ | 266/430 [6:02:04<3:43:19, 81.70s/it]
62%|βββββββ | 267/430 [6:03:25<3:40:59, 81.34s/it]
62%|βββββββ | 268/430 [6:04:45<3:39:08, 81.16s/it]
63%|βββββββ | 269/430 [6:06:06<3:37:16, 80.97s/it]
63%|βββββββ | 270/430 [6:07:26<3:35:33, 80.84s/it]
{'loss': 0.3066, 'learning_rate': 4.513133979123424e-05, 'epoch': 6.28} |
|
63%|βββββββ | 270/430 [6:07:26<3:35:33, 80.84s/it]
63%|βββββββ | 271/430 [6:08:47<3:34:06, 80.80s/it]
63%|βββββββ | 272/430 [6:10:08<3:32:33, 80.72s/it]
63%|βββββββ | 273/430 [6:11:28<3:31:01, 80.65s/it]
64%|βββββββ | 274/430 [6:12:49<3:29:32, 80.60s/it]
64%|βββββββ | 275/430 [6:14:09<3:28:03, 80.54s/it]
{'loss': 0.301, 'learning_rate': 4.527736121541934e-05, 'epoch': 6.39} |
|
64%|βββββββ | 275/430 [6:14:09<3:28:03, 80.54s/it]
64%|βββββββ | 276/430 [6:15:29<3:26:41, 80.53s/it]
64%|βββββββ | 277/430 [6:16:50<3:25:18, 80.51s/it]
65%|βββββββ | 278/430 [6:18:10<3:23:55, 80.50s/it]
65%|βββββββ | 279/430 [6:19:31<3:22:38, 80.52s/it]
65%|βββββββ | 280/430 [6:20:52<3:21:30, 80.60s/it]
{'loss': 0.2992, 'learning_rate': 4.5420779623067014e-05, 'epoch': 6.5} |
|
65%|βββββββ | 280/430 [6:20:52<3:21:30, 80.60s/it]
65%|βββββββ | 281/430 [6:22:12<3:20:06, 80.58s/it]
66%|βββββββ | 282/430 [6:23:33<3:18:45, 80.58s/it]
66%|βββββββ | 283/430 [6:24:54<3:17:25, 80.58s/it]
66%|βββββββ | 284/430 [6:26:14<3:16:05, 80.59s/it]
66%|βββββββ | 285/430 [6:27:35<3:14:40, 80.56s/it]
{'loss': 0.2963, 'learning_rate': 4.55616861952542e-05, 'epoch': 6.62} |
|
66%|βββββββ | 285/430 [6:27:35<3:14:40, 80.56s/it]
67%|βββββββ | 286/430 [6:28:55<3:13:19, 80.55s/it]
67%|βββββββ | 287/430 [6:30:16<3:11:55, 80.53s/it]
67%|βββββββ | 288/430 [6:31:37<3:10:52, 80.65s/it]
67%|βββββββ | 289/430 [6:32:57<3:09:26, 80.62s/it]
67%|βββββββ | 290/430 [6:34:18<3:07:59, 80.57s/it]
{'loss': 0.2912, 'learning_rate': 4.5700167404435284e-05, 'epoch': 6.73} |
|
67%|βββββββ | 290/430 [6:34:18<3:07:59, 80.57s/it]
68%|βββββββ | 291/430 [6:35:39<3:06:56, 80.69s/it]
68%|βββββββ | 292/430 [6:36:59<3:05:26, 80.63s/it]
68%|βββββββ | 293/430 [6:38:20<3:04:03, 80.61s/it]
68%|βββββββ | 294/430 [6:39:40<3:02:37, 80.57s/it]
69%|βββββββ | 295/430 [6:41:01<3:01:16, 80.56s/it]
{'loss': 0.2916, 'learning_rate': 4.583630533316995e-05, 'epoch': 6.85} |
|
69%|βββββββ | 295/430 [6:41:01<3:01:16, 80.56s/it]
69%|βββββββ | 296/430 [6:42:21<2:59:53, 80.55s/it]
69%|βββββββ | 297/430 [6:43:42<2:58:55, 80.72s/it]
69%|βββββββ | 298/430 [6:45:03<2:57:26, 80.66s/it]
70%|βββββββ | 299/430 [6:46:23<2:55:57, 80.59s/it]
70%|βββββββ | 300/430 [6:47:44<2:54:31, 80.55s/it]
{'loss': 0.2918, 'learning_rate': 4.597017796633075e-05, 'epoch': 6.96} |
|
70%|βββββββ | 300/430 [6:47:44<2:54:31, 80.55s/it]Saving model checkpoint to ./results/checkpoint-300 |
|
Configuration saved in ./results/checkpoint-300/config.json |
|
Model weights saved in ./results/checkpoint-300/pytorch_model.bin |
|
tokenizer config file saved in ./results/checkpoint-300/tokenizer_config.json |
|
Special tokens file saved in ./results/checkpoint-300/special_tokens_map.json |
|
[2023-02-20 23:54:13,030] [INFO] [logging.py:75:log_dist] [Rank 0] [Torch] Checkpoint global_step303 is begin to save! |
|
[2023-02-20 23:54:13,032] [INFO] [logging.py:75:log_dist] [Rank 0] Saving model checkpoint: ./results/checkpoint-300/global_step303/mp_rank_00_model_states.pt |
|
[2023-02-20 23:54:13,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./results/checkpoint-300/global_step303/mp_rank_00_model_states.pt... |
|
[2023-02-20 23:54:13,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./results/checkpoint-300/global_step303/mp_rank_00_model_states.pt. |
|
[2023-02-20 23:54:13,587] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./results/checkpoint-300/global_step303/zero_pp_rank_0_mp_rank_00_optim_states.pt... |
|
[2023-02-20 23:54:13,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./results/checkpoint-300/global_step303/zero_pp_rank_0_mp_rank_00_optim_states.pt. |
|
[2023-02-20 23:54:13,740] [INFO] [engine.py:3407:_save_zero_checkpoint] zero checkpoint saved ./results/checkpoint-300/global_step303/zero_pp_rank_0_mp_rank_00_optim_states.pt |
|
[2023-02-20 23:54:13,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step303 is ready now! |
|
Deleting older checkpoint [results/checkpoint-150] due to args.save_total_limit |
|
70%|βββββββ | 301/430 [6:49:06<2:54:08, 80.99s/it]
70%|βββββββ | 302/430 [6:51:15<3:24:02, 95.64s/it]
70%|βββββββ | 303/430 [6:52:36<3:12:48, 91.09s/it]
71%|βββββββ | 304/430 [6:53:56<3:04:36, 87.91s/it]
71%|βββββββ | 305/430 [6:55:17<2:58:30, 85.68s/it]
{'loss': 0.3224, 'learning_rate': 4.612793909203516e-05, 'epoch': 7.09} |
|
71%|βββββββ | 305/430 [6:55:17<2:58:30, 85.68s/it]
71%|βββββββ | 306/430 [6:56:38<2:54:00, 84.19s/it]
71%|ββββββββ | 307/430 [6:57:59<2:50:44, 83.29s/it]
72%|ββββββββ | 308/430 [6:59:19<2:47:40, 82.46s/it]
72%|ββββββββ | 309/430 [7:00:40<2:45:10, 81.91s/it]
72%|ββββββββ | 310/430 [7:02:00<2:42:57, 81.48s/it]
{'loss': 0.2838, 'learning_rate': 4.6257084073957534e-05, 'epoch': 7.21} |
|
72%|ββββββββ | 310/430 [7:02:00<2:42:57, 81.48s/it]
72%|ββββββββ | 311/430 [7:03:21<2:41:01, 81.18s/it]
73%|ββββββββ | 312/430 [7:04:41<2:39:14, 80.97s/it]
73%|ββββββββ | 313/430 [7:06:02<2:37:35, 80.82s/it]
73%|ββββββββ | 314/430 [7:07:22<2:36:02, 80.71s/it]
73%|ββββββββ | 315/430 [7:08:43<2:34:33, 80.64s/it]
{'loss': 0.2803, 'learning_rate': 4.6384188765246125e-05, 'epoch': 7.32} |
|
73%|ββββββββ | 315/430 [7:08:43<2:34:33, 80.64s/it]
73%|ββββββββ | 316/430 [7:10:03<2:33:10, 80.62s/it]
74%|ββββββββ | 317/430 [7:11:24<2:31:46, 80.59s/it]
74%|ββββββββ | 318/430 [7:12:44<2:30:23, 80.56s/it]
74%|ββββββββ | 319/430 [7:14:05<2:29:01, 80.56s/it]
74%|ββββββββ | 320/430 [7:15:26<2:27:42, 80.57s/it]
{'loss': 0.2839, 'learning_rate': 4.6509316631405805e-05, 'epoch': 7.44} |
|
74%|ββββββββ | 320/430 [7:15:26<2:27:42, 80.57s/it]
75%|ββββββββ | 321/430 [7:16:46<2:26:19, 80.54s/it]
75%|ββββββββ | 322/430 [7:18:07<2:24:58, 80.54s/it]
75%|ββββββββ | 323/430 [7:19:27<2:23:39, 80.55s/it]
75%|ββββββββ | 324/430 [7:20:48<2:22:15, 80.53s/it]
76%|ββββββββ | 325/430 [7:22:08<2:20:55, 80.53s/it]
{'loss': 0.2776, 'learning_rate': 4.663252822198809e-05, 'epoch': 7.55} |
|
76%|ββββββββ | 325/430 [7:22:08<2:20:55, 80.53s/it]
76%|ββββββββ | 326/430 [7:23:29<2:19:34, 80.52s/it]
76%|ββββββββ | 327/430 [7:24:49<2:18:11, 80.50s/it]
76%|ββββββββ | 328/430 [7:26:10<2:16:52, 80.52s/it]
77%|ββββββββ | 329/430 [7:27:30<2:15:32, 80.52s/it]
77%|ββββββββ | 330/430 [7:28:51<2:14:13, 80.53s/it]
{'loss': 0.2763, 'learning_rate': 4.675388134653313e-05, 'epoch': 7.66} |
|
77%|ββββββββ | 330/430 [7:28:51<2:14:13, 80.53s/it]
77%|ββββββββ | 331/430 [7:30:11<2:12:57, 80.58s/it]
77%|ββββββββ | 332/430 [7:31:32<2:11:44, 80.66s/it]
77%|ββββββββ | 333/430 [7:32:53<2:10:32, 80.75s/it]
78%|ββββββββ | 334/430 [7:34:14<2:09:04, 80.67s/it]
78%|ββββββββ | 335/430 [7:35:34<2:07:39, 80.62s/it]
{'loss': 0.2731, 'learning_rate': 4.687343123743873e-05, 'epoch': 7.78} |
|
78%|ββββββββ | 335/430 [7:35:34<2:07:39, 80.62s/it]
78%|ββββββββ | 336/430 [7:36:55<2:06:30, 80.75s/it]
78%|ββββββββ | 337/430 [7:38:16<2:05:08, 80.73s/it]
79%|ββββββββ | 338/430 [7:39:36<2:03:41, 80.66s/it]
79%|ββββββββ | 339/430 [7:40:57<2:02:15, 80.61s/it]
79%|ββββββββ | 340/430 [7:42:17<2:00:50, 80.56s/it]
{'loss': 0.271, 'learning_rate': 4.699123070090503e-05, 'epoch': 7.89} |
|
79%|ββββββββ | 340/430 [7:42:17<2:00:50, 80.56s/it]
79%|ββββββββ | 341/430 [7:43:38<1:59:27, 80.53s/it]
80%|ββββββββ | 342/430 [7:44:58<1:58:06, 80.53s/it]
80%|ββββββββ | 343/430 [7:46:19<1:56:47, 80.55s/it]
80%|ββββββββ | 344/430 [7:47:40<1:55:30, 80.59s/it]
80%|ββββββββ | 345/430 [7:49:49<2:14:57, 95.26s/it]
{'loss': 0.3009, 'learning_rate': 4.71073302569872e-05, 'epoch': 8.02} |
|
80%|ββββββββ | 345/430 [7:49:49<2:14:57, 95.26s/it]
80%|ββββββββ | 346/430 [7:51:10<2:07:10, 90.84s/it]
81%|ββββββββ | 347/430 [7:52:30<2:01:21, 87.73s/it]
81%|ββββββββ | 348/430 [7:53:51<1:56:55, 85.56s/it]
81%|ββββββββ | 349/430 [7:55:11<1:53:26, 84.03s/it]
81%|βββββββββ | 350/430 [7:56:32<1:50:39, 83.00s/it]
{'loss': 0.265, 'learning_rate': 4.7221778269686165e-05, 'epoch': 8.14} |
|
81%|βββββββββ | 350/430 [7:56:32<1:50:39, 83.00s/it]
82%|βββββββββ | 351/430 [7:57:52<1:48:17, 82.24s/it]
82%|βββββββββ | 352/430 [7:59:13<1:46:14, 81.72s/it]
82%|βββββββββ | 353/430 [8:00:34<1:44:40, 81.56s/it]
82%|βββββββββ | 354/430 [8:01:54<1:42:53, 81.22s/it]
83%|βββββββββ | 355/430 [8:03:15<1:41:14, 80.99s/it]
{'loss': 0.2651, 'learning_rate': 4.73346210679156e-05, 'epoch': 8.25} |
|
83%|βββββββββ | 355/430 [8:03:15<1:41:14, 80.99s/it]
83%|βββββββββ | 356/430 [8:04:36<1:39:57, 81.05s/it]
83%|βββββββββ | 357/430 [8:05:56<1:38:25, 80.90s/it]
83%|βββββββββ | 358/430 [8:07:17<1:36:55, 80.77s/it]
83%|βββββββββ | 359/430 [8:08:38<1:35:30, 80.71s/it]
84%|βββββββββ | 360/430 [8:09:58<1:34:03, 80.63s/it]
{'loss': 0.2607, 'learning_rate': 4.744590305810234e-05, 'epoch': 8.37} |
|
84%|βββββββββ | 360/430 [8:09:58<1:34:03, 80.63s/it]
84%|βββββββββ | 361/430 [8:11:18<1:32:39, 80.57s/it]
84%|βββββββββ | 362/430 [8:12:39<1:31:17, 80.55s/it]
84%|βββββββββ | 363/430 [8:13:59<1:29:56, 80.54s/it]
85%|βββββββββ | 364/430 [8:15:20<1:28:34, 80.52s/it]
85%|βββββββββ | 365/430 [8:16:40<1:27:12, 80.50s/it]
{'loss': 0.2599, 'learning_rate': 4.7555666829105464e-05, 'epoch': 8.48} |
|
85%|βββββββββ | 365/430 [8:16:40<1:27:12, 80.50s/it]
85%|βββββββββ | 366/430 [8:18:01<1:25:49, 80.47s/it]
85%|βββββββββ | 367/430 [8:19:21<1:24:27, 80.44s/it]
86%|βββββββββ | 368/430 [8:20:42<1:23:10, 80.49s/it]
86%|βββββββββ | 369/430 [8:22:02<1:21:48, 80.47s/it]
86%|βββββββββ | 370/430 [8:23:23<1:20:29, 80.49s/it]
{'loss': 0.2564, 'learning_rate': 4.7663953250074004e-05, 'epoch': 8.6} |
|
86%|βββββββββ | 370/430 [8:23:23<1:20:29, 80.49s/it]
86%|βββββββββ | 371/430 [8:24:43<1:19:09, 80.51s/it]
87%|βββββββββ | 372/430 [8:26:04<1:17:55, 80.61s/it]
87%|βββββββββ | 373/430 [8:27:25<1:16:32, 80.57s/it]
87%|βββββββββ | 374/430 [8:28:45<1:15:10, 80.55s/it]
87%|βββββββββ | 375/430 [8:30:06<1:13:49, 80.54s/it]
{'loss': 0.2584, 'learning_rate': 4.777080156180637e-05, 'epoch': 8.71} |
|
87%|βββββββββ | 375/430 [8:30:06<1:13:49, 80.54s/it]
87%|βββββββββ | 376/430 [8:31:26<1:12:29, 80.54s/it]
88%|βββββββββ | 377/430 [8:32:47<1:11:07, 80.53s/it]
88%|βββββββββ | 378/430 [8:34:07<1:09:47, 80.53s/it]
88%|βββββββββ | 379/430 [8:35:28<1:08:26, 80.52s/it]
88%|βββββββββ | 380/430 [8:36:48<1:07:07, 80.55s/it]
{'loss': 0.2578, 'learning_rate': 4.7876249462122306e-05, 'epoch': 8.83} |
|
88%|βββββββββ | 380/430 [8:36:48<1:07:07, 80.55s/it]
89%|βββββββββ | 381/430 [8:38:09<1:05:47, 80.56s/it]
89%|βββββββββ | 382/430 [8:39:29<1:04:25, 80.53s/it]
89%|βββββββββ | 383/430 [8:40:50<1:03:04, 80.53s/it]
89%|βββββββββ | 384/430 [8:42:10<1:01:43, 80.51s/it]
90%|βββββββββ | 385/430 [8:43:31<1:00:22, 80.49s/it]
{'loss': 0.2495, 'learning_rate': 4.798033318571224e-05, 'epoch': 8.94} |
|
90%|βββββββββ | 385/430 [8:43:31<1:00:22, 80.49s/it]
90%|βββββββββ | 386/430 [8:44:52<59:06, 80.61s/it]
90%|βββββββββ | 387/430 [8:46:12<57:43, 80.55s/it]
90%|βββββββββ | 388/430 [8:48:22<1:06:42, 95.31s/it]
90%|βββββββββ | 389/430 [8:49:42<1:02:05, 90.85s/it]
91%|βββββββββ | 390/430 [8:51:03<58:29, 87.73s/it]
{'loss': 0.2808, 'learning_rate': 4.810348191078279e-05, 'epoch': 9.07} |
|
91%|βββββββββ | 390/430 [8:51:03<58:29, 87.73s/it]
91%|βββββββββ | 391/430 [8:52:23<55:37, 85.58s/it]
91%|βββββββββ | 392/430 [8:53:44<53:14, 84.05s/it]
91%|ββββββββββ| 393/430 [8:55:04<51:10, 82.98s/it]
92%|ββββββββββ| 394/430 [8:56:25<49:25, 82.39s/it]
92%|ββββββββββ| 395/430 [8:57:46<47:43, 81.82s/it]
{'loss': 0.2463, 'learning_rate': 4.82046852530342e-05, 'epoch': 9.18} |
|
92%|ββββββββββ| 395/430 [8:57:46<47:43, 81.82s/it]
92%|ββββββββββ| 396/430 [8:59:06<46:09, 81.45s/it]
92%|ββββββββββ| 397/430 [9:00:27<44:37, 81.15s/it]
93%|ββββββββββ| 398/430 [9:01:47<43:09, 80.93s/it]
93%|ββββββββββ| 399/430 [9:03:08<41:44, 80.80s/it]
93%|ββββββββββ| 400/430 [9:04:28<40:20, 80.70s/it]
{'loss': 0.2418, 'learning_rate': 4.830463137837162e-05, 'epoch': 9.3} |
|
93%|ββββββββββ| 400/430 [9:04:28<40:20, 80.70s/it]
93%|ββββββββββ| 401/430 [9:05:49<38:58, 80.65s/it]
93%|ββββββββββ| 402/430 [9:07:09<37:38, 80.65s/it]
94%|ββββββββββ| 403/430 [9:08:30<36:16, 80.59s/it]
94%|ββββββββββ| 404/430 [9:09:50<34:54, 80.56s/it]
94%|ββββββββββ| 405/430 [9:11:11<33:33, 80.54s/it]
{'loss': 0.2451, 'learning_rate': 4.8403351139919656e-05, 'epoch': 9.41} |
|
94%|ββββββββββ| 405/430 [9:11:11<33:33, 80.54s/it]
94%|ββββββββββ| 406/430 [9:12:32<32:14, 80.61s/it]
95%|ββββββββββ| 407/430 [9:13:52<30:52, 80.56s/it]
95%|ββββββββββ| 408/430 [9:15:12<29:31, 80.54s/it]
95%|ββββββββββ| 409/430 [9:16:33<28:11, 80.54s/it]
95%|ββββββββββ| 410/430 [9:17:53<26:50, 80.51s/it]
{'loss': 0.2403, 'learning_rate': 4.850087426881512e-05, 'epoch': 9.53} |
|
95%|ββββββββββ| 410/430 [9:17:53<26:50, 80.51s/it]
96%|ββββββββββ| 411/430 [9:19:15<25:33, 80.70s/it]
96%|ββββββββββ| 412/430 [9:20:35<24:11, 80.65s/it]
96%|ββββββββββ| 413/430 [9:21:56<22:50, 80.61s/it]
96%|ββββββββββ| 414/430 [9:23:16<21:28, 80.56s/it]
97%|ββββββββββ| 415/430 [9:24:37<20:09, 80.62s/it]
{'loss': 0.2411, 'learning_rate': 4.859722942795827e-05, 'epoch': 9.64} |
|
97%|ββββββββββ| 415/430 [9:24:37<20:09, 80.62s/it]
97%|ββββββββββ| 416/430 [9:25:57<18:48, 80.58s/it]
97%|ββββββββββ| 417/430 [9:27:18<17:27, 80.54s/it]
97%|ββββββββββ| 418/430 [9:28:38<16:06, 80.54s/it]
97%|ββββββββββ| 419/430 [9:29:59<14:45, 80.51s/it]
98%|ββββββββββ| 420/430 [9:31:20<13:26, 80.65s/it]
{'loss': 0.2373, 'learning_rate': 4.8692444262583224e-05, 'epoch': 9.76} |
|
98%|ββββββββββ| 420/430 [9:31:20<13:26, 80.65s/it]
98%|ββββββββββ| 421/430 [9:32:40<12:05, 80.59s/it]
98%|ββββββββββ| 422/430 [9:34:01<10:44, 80.56s/it]
98%|ββββββββββ| 423/430 [9:35:21<09:23, 80.53s/it]
99%|ββββββββββ| 424/430 [9:36:42<08:03, 80.54s/it]
99%|ββββββββββ| 425/430 [9:38:02<06:42, 80.52s/it]
{'loss': 0.2356, 'learning_rate': 4.8786545447870833e-05, 'epoch': 9.87} |
|
99%|ββββββββββ| 425/430 [9:38:02<06:42, 80.52s/it]
99%|ββββββββββ| 426/430 [9:39:23<05:22, 80.52s/it]
99%|ββββββββββ| 427/430 [9:40:43<04:01, 80.52s/it]
100%|ββββββββββ| 428/430 [9:42:04<02:41, 80.53s/it]
100%|ββββββββββ| 429/430 [9:43:24<01:20, 80.55s/it]
100%|ββββββββββ| 430/430 [9:44:45<00:00, 80.58s/it]
{'loss': 0.2362, 'learning_rate': 4.8879558733809264e-05, 'epoch': 9.99} |
|
100%|ββββββββββ| 430/430 [9:44:45<00:00, 80.58s/it] |
|
|
|
Training completed. Do not forget to share your model on huggingface.co/models =) |
|
|
|
|
|
Time: 35085.85Time: 35085.80 |
|
Time: 35085.78 |
|
Samples/second: 33.41Time: 35085.88 |
|
Samples/second: 33.41 |
|
|
|
Samples/second: 33.41 |
|
Samples/second: 33.41 |
|
|
|
{'train_runtime': 35085.4942, 'train_samples_per_second': 33.413, 'train_steps_per_second': 0.012, 'train_loss': 0.412373732966046, 'epoch': 9.99} |
|
Time: 35085.79 |
|
Time: 35085.76Samples/second: 33.41 |
|
|
|
Samples/second: 33.41 |
|
100%|ββββββββββ| 430/430 [9:44:45<00:00, 80.58s/it]GPU memory occupied: 43825 MB. |
|
GPU memory occupied: 43825 MB. |
|
GPU memory occupied: 43825 MB. |
|
GPU memory occupied: 43825 MB. |
|
GPU memory occupied: 43825 MB. |
|
GPU memory occupied: 43825 MB. |
|
100%|ββββββββββ| 430/430 [9:44:45<00:00, 81.59s/it] |
|
Time: 35085.49 |
|
Samples/second: 33.41 |
|
GPU memory occupied: 43825 MB. |
|
Configuration saved in experiments/2023-02-21-b0010c97cb1f06debca911602ea05b6ff85a8270fb9487d27b3d52eb4eb29e9e/final_checkpoint/config.json |
|
Model weights saved in experiments/2023-02-21-b0010c97cb1f06debca911602ea05b6ff85a8270fb9487d27b3d52eb4eb29e9e/final_checkpoint/pytorch_model.bin |
|
tokenizer config file saved in experiments/2023-02-21-b0010c97cb1f06debca911602ea05b6ff85a8270fb9487d27b3d52eb4eb29e9e/final_checkpoint/tokenizer/tokenizer_config.json |
|
Special tokens file saved in experiments/2023-02-21-b0010c97cb1f06debca911602ea05b6ff85a8270fb9487d27b3d52eb4eb29e9e/final_checkpoint/tokenizer/special_tokens_map.json |
|
Saving model checkpoint to experiments/2023-02-21-b0010c97cb1f06debca911602ea05b6ff85a8270fb9487d27b3d52eb4eb29e9e/trainer_final_checkpoint |
|
Configuration saved in experiments/2023-02-21-b0010c97cb1f06debca911602ea05b6ff85a8270fb9487d27b3d52eb4eb29e9e/trainer_final_checkpoint/config.json |
|
Model weights saved in experiments/2023-02-21-b0010c97cb1f06debca911602ea05b6ff85a8270fb9487d27b3d52eb4eb29e9e/trainer_final_checkpoint/pytorch_model.bin |
|
tokenizer config file saved in experiments/2023-02-21-b0010c97cb1f06debca911602ea05b6ff85a8270fb9487d27b3d52eb4eb29e9e/trainer_final_checkpoint/tokenizer_config.json |
|
Special tokens file saved in experiments/2023-02-21-b0010c97cb1f06debca911602ea05b6ff85a8270fb9487d27b3d52eb4eb29e9e/trainer_final_checkpoint/special_tokens_map.json |
|
Traceback (most recent call last): |
|
File "tune_gpt.py", line 227, in <module> |
|
trainer.save_state(trainer_save_dir) |
|
TypeError: save_state() takes 1 positional argument but 2 were given |
|
[2023-02-21 02:51:15,357] [INFO] [launch.py:350:main] Process 31459 exits successfully. |
|
[2023-02-21 02:51:15,358] [INFO] [launch.py:350:main] Process 31463 exits successfully. |
|
[2023-02-21 02:51:16,360] [INFO] [launch.py:350:main] Process 31486 exits successfully. |
|
[2023-02-21 02:51:16,360] [INFO] [launch.py:350:main] Process 31471 exits successfully. |
|
[2023-02-21 02:51:16,360] [INFO] [launch.py:350:main] Process 31478 exits successfully. |
|
[2023-02-21 02:51:16,361] [INFO] [launch.py:350:main] Process 31490 exits successfully. |
|
[2023-02-21 02:51:17,362] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 31458 |
|
[2023-02-21 02:51:17,363] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 31459 |
|
[2023-02-21 02:51:17,363] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 31463 |
|
[2023-02-21 02:51:17,363] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 31471 |
|
[2023-02-21 02:51:17,363] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 31478 |
|
[2023-02-21 02:51:17,363] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 31486 |
|
[2023-02-21 02:51:17,363] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 31490 |
|
[2023-02-21 02:51:17,364] [ERROR] [launch.py:324:sigkill_handler] ['/opt/conda/bin/python3', '-u', 'tune_gpt.py', '--local_rank=6', '--deepspeed', 'deepspeed.json', '--upload-experiment'] exits with return code = 1 |
|
/opt/conda/lib/python3.8/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. |
|
from pandas import MultiIndex, Int64Index |
|
|