nohup: ignoring input [2022-12-19 10:45:52,072] [WARNING] [runner.py:179:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2022-12-19 10:45:52,087] [INFO] [runner.py:508:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 tune_gpt.py --deepspeed deepspeed.json --limit=10 --local_rank=-1 [2022-12-19 10:45:53,665] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1]} [2022-12-19 10:45:53,665] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0 [2022-12-19 10:45:53,665] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1]}) [2022-12-19 10:45:53,665] [INFO] [launch.py:162:main] dist_world_size=2 [2022-12-19 10:45:53,665] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1 No config specified, defaulting to: apps/all Found cached dataset apps (/home/user/.cache/huggingface/datasets/codeparrot___apps/all/0.0.0/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5) Max length: 2048 PyTorch: setting up devices [2022-12-19 10:46:03,625] [INFO] [comm.py:654:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl No config specified, defaulting to: apps/all Found cached dataset apps (/home/user/.cache/huggingface/datasets/codeparrot___apps/all/0.0.0/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5) Max length: 2048 PyTorch: setting up devices The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-). GPU memory occupied: 3108 MB. The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-). GPU memory occupied: 3108 MB. Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/user/.cache/torch_extensions/py38_cu116/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.6374411582946777 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 2.604843854904175 seconds Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... Emitting ninja build file /home/user/.cache/torch_extensions/py38_cu116/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.2902805805206299 seconds Loading extension module utils... Time to load utils op: 0.2025001049041748 seconds Rank: 0 partition count [2] and sizes[(62599296, False)] Rank: 1 partition count [2] and sizes[(62599296, False)] Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0009868144989013672 seconds Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0003800392150878906 seconds 0%| | 0/55 [00:00 shutil.move(os.path.join(pwd_path, "output.log"), os.path.join(final_save_dir)) File "/usr/lib/python3.8/shutil.py", line 789, in move raise Error("Destination path '%s' already exists" % real_dst) shutil.Error: Destination path 'experiments/2022-12-19-ab8f3a39c84fea7f66bf71860384bbce5df5fb3523e7dabd22b35c3ecfefb154/output.log' already exists [2022-12-19 10:47:33,785] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 144520 [2022-12-19 10:47:33,786] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 144521 [2022-12-19 10:47:33,793] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python3', '-u', 'tune_gpt.py', '--local_rank=1', '--deepspeed', 'deepspeed.json', '--limit=10', '--local_rank=-1'] exits with return code = 1