nohup: ignoring input [2022-12-18 10:53:56,268] [WARNING] [runner.py:179:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2022-12-18 10:53:56,292] [INFO] [runner.py:508:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 tune_gpt.py --deepspeed deepspeed.json --upload-model [2022-12-18 10:53:57,962] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1]} [2022-12-18 10:53:57,962] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0 [2022-12-18 10:53:57,962] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1]}) [2022-12-18 10:53:57,962] [INFO] [launch.py:162:main] dist_world_size=2 [2022-12-18 10:53:57,962] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. No config specified, defaulting to: apps/all No config specified, defaulting to: apps/all Found cached dataset apps (/home/user/.cache/huggingface/datasets/codeparrot___apps/all/0.0.0/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5) Found cached dataset apps (/home/user/.cache/huggingface/datasets/codeparrot___apps/all/0.0.0/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5) Max length: 2048 Max length: 2048 PyTorch: setting up devices PyTorch: setting up devices [2022-12-18 10:54:11,976] [INFO] [comm.py:654:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-). The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-). GPU memory occupied: 3404 MB. GPU memory occupied: 3404 MB. Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/user/.cache/torch_extensions/py38_cu116/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.6207056045532227 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 2.6935393810272217 seconds Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... Emitting ninja build file /home/user/.cache/torch_extensions/py38_cu116/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.31526732444763184 seconds Loading extension module utils... Time to load utils op: 0.3027935028076172 seconds Rank: 0 partition count [2] and sizes[(62600064, False)] Rank: 1 partition count [2] and sizes[(62600064, False)] Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0005586147308349609 seconds Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.00031638145446777344 seconds 0%| | 0/48845 [00:00 shutil.move(os.path.join(pwd_path, "output.log"), os.path.join(final_save_dir)) File "/usr/lib/python3.8/shutil.py", line 789, in move raise Error("Destination path '%s' already exists" % real_dst) shutil.Error: Destination path 'experiments/2022-12-19-fdf21cd1874b02afe17fee417ba59c79dadadd87f9b5944402c89d476acb4861/output.log' already exists [2022-12-19 04:13:54,430] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2056 [2022-12-19 04:13:54,430] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2057 [2022-12-19 04:13:55,051] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python3', '-u', 'tune_gpt.py', '--local_rank=1', '--deepspeed', 'deepspeed.json', '--upload-model'] exits with return code = 1